Re: Best way to read from directory with many files
- From: "Peter Duniho" <NpOeStPeAdM@xxxxxxxxxxxxxxxx>
- Date: Sat, 21 Feb 2009 21:29:49 -0800
On Sat, 21 Feb 2009 17:16:01 -0800, nickdu <nicknospamdu@xxxxxxxxxxxxxxxx> wrote:
If I have a multi-threaded application which processes files from a
directory, what might be the best way to divy those files up to multiple
threads? I don't want the threads to be colliding on the same files. Once a
thread is processing a file I need to make sure another thread doesn't start
processing it.
What's the bottleneck? Is your algorithm CPU-bound: mostly computations, with a little bit of i/o from the files to feed the computation? Or is the algorithm i/o-bound: mostly reading from the files, with a little bit of computation?
Or put another way: if you wanted it to go faster, would you need a faster disk, or a faster CPU?
The reason I ask is that if you're i/o-bound, then more than one thread is overkill and likely to reduce performance rather than improve it.
If you're CPU bound, then you can either partition the work ahead of time, handing a complete list of files to process to each thread before they start. Or you can have a central data structure that each thread consumes from as they make progress.
The former is better if you expect the actual processing to be quick and having many threads consuming from the same data structure would create too much contention to allow the algorithm to scale well. The latter is better if the processing is expected to take some time, and the processing time for each file is too variable to be able to effectively partition the work in advance.
Note that in either case, you won't want the number of threads to be much more than the number of CPU cores (ideally, you'd want exactly the same number, but having a few extra will allow the CPU to stay busy when a thread gets stuck waiting on the disk).
Also note that in the partition-in-advance scenario, you don't need a whole new data structure. You can take the array you get from Directory.GetFiles() and pass it to each thread, along with a start and end index for each thread to actually use.
Also, if there are a million or so files in the directory I'm
thinking that Directory.GetFiles() might not be the best way to access that
list.
A million-element array is big, but even if all the filenames are the maximum length, you're "only" talking about 256MB worth of storage for that. Even for 32-bit Windows, that's not a huge problem, and for 64-bit Windows it should be nothing. And of course, the real-world scenario probably involves file paths much shorter than that, and so much less memory usage.
Of course, there's the question as to why you've got directories with a million files in them. That seems abusive to me. :) But, that's a separate matter and I'll take as granted it has to be that way.
Can I open the directory as a file and read the entries that way?
If for some reason you find that using Directory.GetFiles() is in fact simply taking too much of a memory hit and hurting performance, I think you'll have to use the unmanaged API through p/invoke. You can see the functions FindFirstFile() and FindNextFile(), which can enumerate filenames one at a time for you, so that all the filenames don't have to be in memory all at once.
Pete
.
- Follow-Ups:
- Re: Best way to read from directory with many files
- From: nickdu
- Re: Best way to read from directory with many files
- References:
- Best way to read from directory with many files
- From: nickdu
- Best way to read from directory with many files
- Prev by Date: Best way to read from directory with many files
- Next by Date: CS0587 compiler error during XAML IMPORT/PARSE step
- Previous by thread: Best way to read from directory with many files
- Next by thread: Re: Best way to read from directory with many files
- Index(es):
Relevant Pages
|