Re: Best way to read from directory with many files

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



On Sat, 21 Feb 2009 17:16:01 -0800, nickdu <nicknospamdu@xxxxxxxxxxxxxxxx> wrote:

If I have a multi-threaded application which processes files from a
directory, what might be the best way to divy those files up to multiple
threads? I don't want the threads to be colliding on the same files. Once a
thread is processing a file I need to make sure another thread doesn't start
processing it.

What's the bottleneck? Is your algorithm CPU-bound: mostly computations, with a little bit of i/o from the files to feed the computation? Or is the algorithm i/o-bound: mostly reading from the files, with a little bit of computation?

Or put another way: if you wanted it to go faster, would you need a faster disk, or a faster CPU?

The reason I ask is that if you're i/o-bound, then more than one thread is overkill and likely to reduce performance rather than improve it.

If you're CPU bound, then you can either partition the work ahead of time, handing a complete list of files to process to each thread before they start. Or you can have a central data structure that each thread consumes from as they make progress.

The former is better if you expect the actual processing to be quick and having many threads consuming from the same data structure would create too much contention to allow the algorithm to scale well. The latter is better if the processing is expected to take some time, and the processing time for each file is too variable to be able to effectively partition the work in advance.

Note that in either case, you won't want the number of threads to be much more than the number of CPU cores (ideally, you'd want exactly the same number, but having a few extra will allow the CPU to stay busy when a thread gets stuck waiting on the disk).

Also note that in the partition-in-advance scenario, you don't need a whole new data structure. You can take the array you get from Directory.GetFiles() and pass it to each thread, along with a start and end index for each thread to actually use.

Also, if there are a million or so files in the directory I'm
thinking that Directory.GetFiles() might not be the best way to access that
list.

A million-element array is big, but even if all the filenames are the maximum length, you're "only" talking about 256MB worth of storage for that. Even for 32-bit Windows, that's not a huge problem, and for 64-bit Windows it should be nothing. And of course, the real-world scenario probably involves file paths much shorter than that, and so much less memory usage.

Of course, there's the question as to why you've got directories with a million files in them. That seems abusive to me. :) But, that's a separate matter and I'll take as granted it has to be that way.

Can I open the directory as a file and read the entries that way?

If for some reason you find that using Directory.GetFiles() is in fact simply taking too much of a memory hit and hurting performance, I think you'll have to use the unmanaged API through p/invoke. You can see the functions FindFirstFile() and FindNextFile(), which can enumerate filenames one at a time for you, so that all the filenames don't have to be in memory all at once.

Pete
.



Relevant Pages

  • Re: pcib allocation failure
    ... pcib1: attempting to grow prefetch window for ... attempting to grow memory window for ... cpu0: on acpi0 ... <ACPI PCI bus> on pcib0 ...
    (freebsd-current)
  • Next July 27: boot failure(hang) on x86_64 box.
    ... Freeing unused kernel memory: 1360k freed ... ACPI: PM-Timer IO Port: 0x488 ... CPU: L2 Cache: 1024K ... # AX.25 network device drivers ...
    (Linux-Kernel)
  • [PATCH] Document Linuxs memory barriers [try #3]
    ... The attached patch documents the Linux kernel's memory barriers. ... I've tried to get rid of the concept of memory accesses appearing on the bus; ... barring implicit enforcement by the CPU. ...
    (Linux-Kernel)
  • Oops in 2.6.28-rc9 and -rc8 -- mtrr issues / e1000e
    ... Bios 1.04beta did show correct memory sizing in dmidecode, ... I hope this is as simple as me doing something glaringly wrong in the kernel ... DMI present. ... CPU: L2 cache: 6144K ...
    (Linux-Kernel)
  • Re: read vs. mmap (or io vs. page faults)
    ... not fit in main memory, and there are overheads related to the heuristics ... But because the CPU is underutilized, ... reasonably sized user buffer). ... You have to measure the actual overhead to see what the actual cost is. ...
    (freebsd-questions)