Re: fastest way to parse a file; Most efficient way to store the data?

From: Nicholas Paldino [.NET/C# MVP] (mvp_at_spam.guard.caspershouse.com)
Date: 12/14/04


Date: Tue, 14 Dec 2004 10:38:58 -0500

hoopsho,

    What I would do is use a FileStream (perhaps with a BufferedStream on
top to reduce reads to the disk) to read the contents of the file. Since
you say the Split method is too slow, I recommend that you read character by
character, tokenizing the lines yourself.

    If you need to make sure that every line is unique, then you would want
to use a struct. However, storing these in a Hashtable will mean that you
have to box and unbox the value every time you want to access this, and
that's extremely inefficient.

    The reason you would use a structure is that it would provide you with a
hashvalues that are equal if all of the hashvalues are the same.

    To get around this, I would recommend creating a structure that has all
the fields you will populate. Then, create a class wrapper around the
structure, which has a private field of the type of the structure. Override
the GetHashCode to return the hash code from the structure. Override Equals
as well to return true when the hash codes match.

    Then, I would store that in the hashtable. When adding a new row, check
to see if there is a value in the hashtable already. If there is, then
ignore it, if not, then store it.

    Then you have the issue of sorting the data on a specific field. The
hashtable isn't going to give you anything, so then I would recommend a
DataSet, or use an ArrayList that holds references to the same items that
are in the Hashtable. Storing a million row table in a data set is going to
crush your server (unless you have a TON of memory that you can spare), and
generally isn't a good idea. The hashtable has the same drawbacks. If you
don't want to use the DataSet and you know the field(s) you want to sort on,
then I say that as you add the items to the hashtable, perform a binary
search through the ArrayList and insert the new value at the position in the
arraylist where it should go. This is an issue as well. Say you have to
insert a reference at the beginning of the list for the millionth record.
Moving 999,999 records forward is not going to be quick.

    It is this kind of scenario where you might want to consider unsafe
code, allocating a buffer of memory, and moving the records around yourself.

    In the end, I would recommend that you use a RDBMS to store the values
temporarily. The hashtable scheme used before can help here. If you detect
that you don't have a row with those values, then you can insert it,
otherwise, do nothing. Granted, the hashtable is going to incur overhead,
and a lot. However, it would reduce the number of operations against the
database that you have to perform, depending on the number of duplicates you
have (on average).

    Finally, when you want to output the results, you can just perform a
query, selecting the fields you want, and sorting it how you wish.

    You are ultimately going to make your decision based on a number of
factors. The biggest one is the amount of RAM that you have on the machine.
If you have a good amount that could support this much information in your
structure in memory, then by all means, use a DataSet. However, if you do
not, or the number of duplicate records is high, then go with the
Hashtabe/DB solution, as it would be much faster than looping through all of
the items yourself, comparing fields, trying to sort, etc, etc.

    Hope this helps.

--
               - Nicholas Paldino [.NET/C# MVP]
               - mvp@spam.guard.caspershouse.com
"hoopsho" <hoopsho@gmail.com> wrote in message 
news:1103035605.787057.247630@c13g2000cwb.googlegroups.com...
> Hi Everyone,
> I am trying to write a program that does a few things very fast
> and with efficient use of memory...
>
> a)  I need to parse a space-delimited file that is really large,
> upwards fo a million lines.
> b)  I need to store the contents into a unique hash.
> c)  I need to then sort the data on a specific field.
> d)  I need to pull out certain fields and report them to the user.
>
> So my questions are as follows:
>
> o  What is the most effiecient way to pull fields out of a file?
> It does not appear that "tokenizing" the string of each line
> into an array via split is very efficient.... is there another way?
> o  I need to take the id of each field and ensure that it does not
> already exist in my "collection".  Is using the arraylist the best way
> to do this?  Should I take the data from each line of the file and make
> it into an object or a STRUCT?  Which is faster and better from a
> memory standpoint?
> o  Which collection has the best sorting capabilities?
> o  I assume that a for/next type of loop is the best way to
> iterate through this collection to output the data... is there a better
> way?
>
> Thanks for any help.
>