Re: Count Lines in (Huge) Text Files
- From: "Peter Duniho" <NpOeStPeAdM@xxxxxxxxxxxxxxxx>
- Date: Thu, 14 Aug 2008 17:22:24 -0700
On Thu, 14 Aug 2008 16:45:40 -0700, NvrBst <nvrbst@xxxxxxxxx> wrote:
[...]
I'm able to count files under 300MB files (100MB in about 2 or 3
seconds, 300MB in about 20 seconds). My problem is that I need to
count lines in really big files (1GB -> 2GB files). Is there any
suggestions to make that possible? Or the above basically as good as
it gets?
On what kind of hard drive are you testing?
Looking at the data at http://www.storagereview.com/, it appears to me that high-end drives top out at around 130MB/s, there's another class at around 90MB/s, and then a third at around 60MB/s. Note that these benchmarks are run using high-end hardware that can ensure that the drive is the only bottleneck.
Depending on what hardware you're actually using, 50MB/s may be about as good as you get. I have no idea why you would be able to read 100MB "in about 2 or 3 seconds" but then take ten times as long to read only 3 times as much data. I'd look into that before I worried about finer implementation details.
As far as your specific comments go...
--Notes--
1. I've tried a lot of combinations of buffSize/streamSize, and 65kb/
65kb seems to work fastest for files above 100MB. Making them higher
doesn't seem to increase speed much, streamSize can be ~8kb without
noticable changes, however, thought it doesn't hurt to keep them the
same.
I would expect best results to come where "streamSize" is set to a value some integral multiple of "buffSize", and where each is some integral multiple of 4096 (the default size of a page in memory).
A few years ago, I was doing some high-throughput disk stuff and my recollection is that I found the same thing you did: larger buffers only helped up to about 8K or so, and past that any improvement was minimal.
2. I've tried other classes/methods like StreamReader, LINQ "Count" on
byte[], ReadByte(), etc, they all are usally a lot slower than the
above.
I don't understand why StreamReader would be "a lot slower". It seems to me that with appropriate settings for its buffer, it should perform better, since it ought to be optimized for line-based i/o. Did you try different buffer sizes for it?
Though, now that I think about it, StreamReader does do a lot of superfluous work, like allocating the string that's returned from ReadLine() and doing character decoding. So maybe if all you care about is the count of lines, a raw scan is best.
3. foreach seems to work a lot faster than the "for(...)" statment
which is why I do both. Cleaner way to do this?
I suppose that depends on why there's the performance difference in the first place. But you could write a single loop like this:
int ib = 0;
foreach (byte b in bArr)
{
if (ib++ < i)
{
if (b == (byte)'\n')
{
numOfLines++;
}
}
else
{
break;
}
}
Assuming what's hurting you in the explicit for() loop is the retrieval of the data and not the counter increment, the above should perform basically as well as a plain foreach() loop.
I'm surprised there's such a significant difference though. After all, foreach() has to allocate an enumerator and repeatedly call into it, where a bunch of branching happens (granted, the CPU will predict most of that correctly), state has to be saved, and the new value has to be returned. That it should cost so much to index the byte array explicitly that some optimization in the enumerator makes up for all the extra work seems really weird to me.
But then, I guess that's why we do performance testing, huh? :)
4. FileOptions.RandomAccess seems to work better than
"FileOptions.SequentialScan" for me, expecially if I call
"CountLines()" twice in a row. IE 300MB files, 20seconds for first,
and ~5seconds for second. Is this normal? Or should SequentialScan
be better for the above? Is disposing whats messing that up?
I would expect SequentialScan to be better for one-time access. After that, I'd be surprised if one was significantly better than the other. The speedup you're seeing is Windows caching the data from the file. It shows that for the first access, your code really isn't the bottleneck; the disk is.
I have to admit, I'm surprised that you are able to alter the first-time-scan performance much if at all by variations in the code. Once the file data's been cached it makes sense that code differences would show up. But for the first-time-scan scenario, only really bad code should be slower than the disk.
Pete
.
- References:
- Count Lines in (Huge) Text Files
- From: NvrBst
- Count Lines in (Huge) Text Files
- Prev by Date: Any Coding Tools or Utilities I Should be Using?
- Next by Date: Re: Any Coding Tools or Utilities I Should be Using?
- Previous by thread: Count Lines in (Huge) Text Files
- Next by thread: Re: Count Lines in (Huge) Text Files
- Index(es):
Relevant Pages
|