Re: Program slow. Garbage Collection
- From: Peter Duniho <no.peted.spam@xxxxxxxxxxxxxxxxxx>
- Date: Mon, 16 Nov 2009 20:23:25 -0800
Lilith wrote:
[...]// Reading one line at a time avoids all that work involved
// in splitting up the data in the file, and avoids having
// a huge amount of data in-memory all at once.
I had tried this as one of my attempts to speed things up but does it
not require recreating the string each time? The results of the
change may have been hidden by my the larger problem. By reading in
the larger string I only had to build one string and call ToLower() on
it once. Would it be possible/feasible to create a static character
array and read the string into it each loop?
Possible? Sure. Anything's possible. Feasible? I believe it would dramatically increase the complexity of the code and not provide any appreciable performance gain. The three different possible approaches discussed so far -- "read the whole file", "read a line at a time as separate strings", "read a line at a time into a pre-allocated buffer" -- all have the bulk of their cost in the same thing: converting characters to lower-case.
Reading a line at a time into separate strings does wind up creating a large number of individual strings. But this really isn't that expensive. The allocation itself is a trivial cost, and the objects are extremely short-lived and so should get collected easily and quickly (as generation 0 objects).
Besides, note that in the implementation you posted, you go ahead and split the file data into an array of strings anyway, and convert each one to lower-case individually. The only difference is that rather than having a small number of objects created and destroyed over and over, that code creates a large number of objects all at once, which then persist for some time.
On the one hand, that does avoid the need to collect the objects created until the very end of processing. On the other hand, each line read winds up with some memory management (and thus GC) activity anyway (because of other operations that involve allocations), and the more objects that you have allocated, the more expensive that activity is (because the object graph has more nodes).
Also, these objects may survive long enough to wind up in the older generations of the heap; on the one hand, that alleviates the cost those objects cause for GC operations, but on the other hand you know for sure those objects are still going to get destroyed soon, so it'd be better for them to not wind up in the older generation where they may remain and cause increased memory pressure for a longer time. And of course, there's simply the overhead cost of having so much extra data in-memory at once regardless of what generation of the heap they wind up in. Smaller memory footprint often translates to better performance, regardless of what else is going on.
Anyway, that's a very long way of saying that it is far from a foregone conclusion that some technique different from simply reading and processing a single line at a time would produce any significant performance improvement, or any improvement at all. The only way to know for sure which is faster would be to code all potential candidates, figure out representative test cases for the code, and measure the actual performance.
Of course, if after the changes I suggest, you still find the performance unacceptable, you might find the exercise worthwhile. I don't _think_ you'd find the alternatives mentioned so far would be significantly better, but at least you'd know for sure. But I wouldn't spend the time on the exercise unless you had a very real performance problem that couldn't be addressed with some simpler improvement.
I suspect with the changes I've suggested, your primary bottleneck will be the disk i/o itself. :) That is, absent data cached already, it probably will take longer to read a line of text from the file than to process it. Even allowing for the likelihood that the i/o will be buffered, the processing probably won't add a significant amount of cost as compared to just reading the file without doing anything to it (*)
Pete
(*) (Which, by the way, suggests a good test before you try to optimize the code much more: see how long it takes just to read the file, and compare that to the cost of reading the file and processing it...you'll never get the whole algorithm to run faster than just reading the file without processing it, so unless doing that is MUCH faster than the existing "read and process" implementation, there's probably not much point in working on optimization. Oh, and of course, note that when testing the file i/o performance that you need to make sure you've eliminated the possibility of any caching effects...Windows will cache the file data even between executions of the program, so unless you prevent that or flush the cache, subsequent tests won't be comparable or even informative).
.
- Follow-Ups:
- Re: Program slow. Garbage Collection
- From: not_a_commie
- Re: Program slow. Garbage Collection
- References:
- Program slow. Garbage Collection
- From: Lilith
- Re: Program slow. Garbage Collection
- From: Peter Duniho
- Re: Program slow. Garbage Collection
- From: Lilith
- Program slow. Garbage Collection
- Prev by Date: Re: Program slow. Garbage Collection
- Next by Date: Find a running executable
- Previous by thread: Re: Program slow. Garbage Collection
- Next by thread: Re: Program slow. Garbage Collection
- Index(es):
Relevant Pages
|