Re: Count Lines in (Huge) Text Files

Tech-Archive recommends: Speed Up your PC by fixing your registry



On Fri, 15 Aug 2008 13:14:31 -0700, NvrBst <nvrbst@xxxxxxxxx> wrote:

Sorry, I should of elaborated more. For the numbers the initial (very
first) load was always about 60 seconds (50-80seconds) for all cases.

Ah. Right. So, that's the true time it takes to read the file. Unless you are scanning the same file over and over, the other numbers are irrelevant. They are only that fast because the OS has cached the file after the first time.

Note that measuring the difference between RandomAccess and Sequential also is irrelevant except for the first time the file is scanned. The whole point of Sequential is to provide a hint to the OS that it doesn't have to keep around older data in the file as it's being read. Of course, the OS is free to ignore the hint, but from your description it sounds like at least some of the time it doesn't and the file data winds up _not_ being cached (hence slower speeds).

There's some (minimal) overhead to caching, so providing that hint could be important when you really are reading a lot of files, all of them just once through. But that doesn't seem to be what you're doing here. RandomAccess is probably the right choice, but mainly because Sequential might be a lie (since you go back and revisit the file). :)

And finally, note that the actual "reading the file" time is orders of magnitude longer than when it's already in memory. This means that whatever performance difference exists in your code, it's completely hidden by the cost of actually reading the data from the disk. With the exception of any truly awful implementations (if you found a version of the code that was 10x or more slower, for example), there really is no point trying to optimize this section of code. The only time it's slow, it's slow for reasons completely unrelated to (and unfixable by) the code itself.

[...]
What I was doing was adding the Position of the newline to a
"List<uint>" (counting was just simpler for me to convey the problem),
and using that Position to get the item in the
"GetVirtualItem(Index#Requested)" call. With my 300MB log files, the
List only grows to be about 3MB so far but, potentially, this could be
bad if say there are only about 10 characters per line, on a 1G file
(I don't think I will run into this though, the lines usally are very
long for my files).

I think using a List<T> of indices is fine. Even a 50MB data structure is probably not a huge problem on modern PCs, assuming you don't need to have a lot of them all at once. And of course, it has the benefit of being completely precise.

But it does add even more overhead to the scanning code, making it even more pointless to optimize the actual "look for line breaks" part. :)

Your estimation-based implementation options both obviate any need to scan the file at all, of course. So not only are they memory-lightweight, if you find that scanning the file is an unacceptable cost, that would be the way to go. The scroll bar will wind up behaving funny though, because the control is going to expect a direct mapping between the scroll bar and rows in the data, which you won't have.

Pete
.



Relevant Pages