Re: optimizing file i/o

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Michael Powe <michael+tbird@xxxxxxxxxxxx> wrote:
> I have written a small app to parse web log files and extract certain
> lines to another file. There is also functionality to count all the
> items that are being filtered out.
>
> I wrote this in c# instead of in perl because the log files are 3-4GB
> and I want faster processing than perl would typically provide. And,
> I'm learning c#.
>
> There are two issues I would like to address: improve the speed of the
> file i/o and control the processing. Right now, this app takes about 20
> min to process a 3GB file on a laptop with a 2Ghz proc and 2GB RAM.
> Processing is implementing a method that both filters and counts. Also,
> it pegs my CPU while it's running.
>
> Below are the filtering and filtering/counting methods.

If the CPU is pegged (which I can understand, given the code), then the
I/O speed isn't the problem.

Some suggestions:

1) Don't create the regular expressions freshly each time. I don't know
whether you've got a lot of small files or just a few big ones, but it
would make more sense to create them once, as you don't need to change
them.

2) Use the option to compile the regular expressions when you create
them. This could improve things enormously.

3) Rather than using a hashtable, consider having an array of ints
along with your array of regular expressions. You could then iterate
through the regular expression array by index rather than by value, and
just increment the relevant int - no hashtable lookup, no unboxing and
then reboxing.

4) If most lines in the file will match one of the filters, try getting
rid of the "all" regular expression, working out the result just by
running all the others. It may not help, but it's worth a try.

Finally, use using statements for your stream readers and writers -
that way, if an exception is thrown, you'll still close the file
immediately.

--
Jon Skeet - <skeet@xxxxxxxxx>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
.



Relevant Pages

  • Re: hash table usage questions
    ... Now your publishing statistics. ... All filters are specified on the command line. ... You really can't pass in regular expressions from the command line. ... hash table and show the TFILEs to the user in the GUI. ...
    (comp.lang.perl.misc)
  • Re: Java and awk (jawk)
    ... These filters have to support regular expressions, but they also have to be able to interpret the key as a number to get ranges. ... I've had a quick look, and it doesn't look easy: the central interpreter class is org.jawk.backend.AVM, and its only real entrypoint is a method: ... I strongly suspect writing something far simpler from scratch (ie a language which can express regular expressions and integer ranges) will be an easier way of getting to the goal that is important to you. ...
    (comp.lang.java.programmer)
  • Re: Success!!!
    ... line filters in OE, (filters out words like Clinton, Obama, mexican, ... ni**er, etc.) and with the newsproxy filter, there is virtually no spam, ... # This list requires that "Enable Regular Expressions" be turned on. ... you need to change the configuration to point to "localhost". ...
    (rec.autos.sport.nascar)
  • Re: uigetfiles
    ... It has filters (both simple and regular expressions), ... complications. ... Doug Schwarz ...
    (comp.soft-sys.matlab)
  • Re: Roy Culley Terry Porters BoyFriend. Both Linux FAGS
    ... Harry Phillips wrote: ... > This is very frustrating getting these filters right. ... > enough with regular expressions to know what will work and what won't. ... Well I just read a regular expressions tutorial and the filters should ...
    (alt.os.linux)