Re: Best way to read from directory with many files
- From: nickdu <nicknospamdu@xxxxxxxxxxxxxxxx>
- Date: Mon, 23 Feb 2009 07:22:10 -0800
Thanks Peter.
Nothing has to be the way it currently is but based on some loose
requirements the current design seems appropriate. Let me explain the
application a bit more which hopefully will shed some more light on the
subject.
We've got files generated from several (maybe a hundred or so) servers.
These files are ETL files which contain trace information. We've got a
processing server which processes these files. Processing involves using the
Win32 API's OpenTrace()/ProcessTrace()/CloseTrace() (via PInvoke) to gain
access to the data in these trace files and then inserting appropriate rows
into a trace database. The database is on a different server.
So as you can see we've got a bunch of machines on one end generating
traces, our processing server in the middle, and a DB at the other end. We
could have our processing engine support socket connections allowing the
servers generating the trace to send the data directly to the processing
process. However, this would mean our processing engine would always have to
be online and that's not desirable. So asynchronous behavior is one
requirement. We could use queuing, which is kind of what we have anyway, but
I would rather go with the file system queuing as opposed to MSMQ or MQSeries.
Our application is mostly IO bound. In terms of the large number of files I
mentioned in the directory, this should only occur at times when our
processing engine has to be down for some period of time. However, when this
condition occurs I don't want to pay a huge cost for this if for some reason
the file system API's I'm using don't behave nicely in this condition. For
instance, often we see Windows Explorer hang for minutes trying to display a
folder with a huge amount of files. Having a more efficient pull model
interface like IEnumFile() (or something like that) might make more sense in
this case. So I was just wondering if opening the directory myself and
enumerating the entries might be more efficient. Of course as files are
processed and remove this might become unmanageable.
I've also ran into issues when opening the files that show up in the
directory we're processing. Sometimes the process creating the file is not
done with it yet so the processing engine encounters and error trying to open
it exclusively. Not sure if there is a prescribed way to handle this type of
workload. I believe some of the unix type processes (sendmail maybe) work
this way by processing files that show up in a directory.
--
Thanks,
Nick
nicknospamdu@xxxxxxxxxxxxxxxx
remove "nospam" change community. to msn.com
"Peter Duniho" wrote:
On Sat, 21 Feb 2009 17:16:01 -0800, nickdu <nicknospamdu@xxxxxxxxxxxxxxxx>.
wrote:
If I have a multi-threaded application which processes files from a
directory, what might be the best way to divy those files up to multiple
threads? I don't want the threads to be colliding on the same files.
Once a
thread is processing a file I need to make sure another thread doesn't
start
processing it.
What's the bottleneck? Is your algorithm CPU-bound: mostly computations,
with a little bit of i/o from the files to feed the computation? Or is
the algorithm i/o-bound: mostly reading from the files, with a little bit
of computation?
Or put another way: if you wanted it to go faster, would you need a faster
disk, or a faster CPU?
The reason I ask is that if you're i/o-bound, then more than one thread is
overkill and likely to reduce performance rather than improve it.
If you're CPU bound, then you can either partition the work ahead of time,
handing a complete list of files to process to each thread before they
start. Or you can have a central data structure that each thread consumes
from as they make progress.
The former is better if you expect the actual processing to be quick and
having many threads consuming from the same data structure would create
too much contention to allow the algorithm to scale well. The latter is
better if the processing is expected to take some time, and the processing
time for each file is too variable to be able to effectively partition the
work in advance.
Note that in either case, you won't want the number of threads to be much
more than the number of CPU cores (ideally, you'd want exactly the same
number, but having a few extra will allow the CPU to stay busy when a
thread gets stuck waiting on the disk).
Also note that in the partition-in-advance scenario, you don't need a
whole new data structure. You can take the array you get from
Directory.GetFiles() and pass it to each thread, along with a start and
end index for each thread to actually use.
Also, if there are a million or so files in the directory I'm
thinking that Directory.GetFiles() might not be the best way to access
that
list.
A million-element array is big, but even if all the filenames are the
maximum length, you're "only" talking about 256MB worth of storage for
that. Even for 32-bit Windows, that's not a huge problem, and for 64-bit
Windows it should be nothing. And of course, the real-world scenario
probably involves file paths much shorter than that, and so much less
memory usage.
Of course, there's the question as to why you've got directories with a
million files in them. That seems abusive to me. :) But, that's a
separate matter and I'll take as granted it has to be that way.
Can I open the directory as a file and read the entries that way?
If for some reason you find that using Directory.GetFiles() is in fact
simply taking too much of a memory hit and hurting performance, I think
you'll have to use the unmanaged API through p/invoke. You can see the
functions FindFirstFile() and FindNextFile(), which can enumerate
filenames one at a time for you, so that all the filenames don't have to
be in memory all at once.
Pete
- Follow-Ups:
- Re: Best way to read from directory with many files
- From: Zhi-Xin Ye [MSFT]
- Re: Best way to read from directory with many files
- References:
- Best way to read from directory with many files
- From: nickdu
- Re: Best way to read from directory with many files
- From: Peter Duniho
- Best way to read from directory with many files
- Prev by Date: Re: Difference in dll loading between debug and release modes
- Next by Date: Re: Difference in dll loading between debug and release modes
- Previous by thread: Re: Best way to read from directory with many files
- Next by thread: Re: Best way to read from directory with many files
- Index(es):
Relevant Pages
|
Loading