High-performance IO



I would like to obtain the highest possible throughput
of disk IO under the following conditions. The program simultaneously reads and writes several large files,
~0.25+ GiB each. Each file must be processed in order,
so no explicit fine-grained parallelization provided by IOCP
is possible -- the file has its own processing thread. The
files are accessed sequentially and only once, so no caching
will help. Currently I use four independent memory buffers
per file and overlapped IO to separate the phases of
reading/writing and processing. The IO completion notifications
are issued through Win32 event objects.
I set the FILE_FLAG_NO_BUFFERING flag to bypass
the filesystem cache and (hopefully) cause the system to
perform explicit DMA transfers directly into the buffers.

But several questions remain open:

1. How big should be the data block to obtain the highest
performance? Currently it is hardcoded to 512KiB, but it
is an early implementation and is subject to change. If
it is OS/filesystem specific, then how can I get the best
value at run time using WinAPI?

2. Does FILE_FLAG_NO_BUFFERING and FILE_FLAG_OVERLAPPED interfere somehow
with FILE_FLAG_SEQUENTIAL_SCAN?
Should I avoid the latter hint when the first two flags
are specified?

3. How should I allocate the memory buffer? A simple
VirtualAlloc will do, but it can optimize (cache coloring,
clustering etc.) the virtual->physical address mapping
for memory access, not for IO, especially for large DMA
transfers. Is there a way to allocate a contiguous block of
physical memory? Will it help much under NT/XP? I'm
asking, because I don't know the details of NT's low-level
disk IO architecture and implementation.

4. Can I directly read into/write from an AWE buffer?

5. Is there something I forgot to use in order to obtain
the highest throughput? Except IOCP, of course. :-)

Best regards
Piotr Wyderski

.



Relevant Pages

  • Re: RegNotifyChangeKeyValue and ipc?
    ... non-scaling 60% speedup with a dual core, quad core, and an Intel SDV with ... one thousand search loops of a 4MB file in a memory buffer done in 10 ... But since the performance bottleneck is the disk, ... My speculation is that cache contention of relatively 4MB memory buffers ...
    (microsoft.public.vc.mfc)
  • Re: RegNotifyChangeKeyValue and ipc?
    ... The data is all in four sets of 4MB memory buffers (originally loaded from a ... non-scaling 60% speedup with a dual core, quad core, and an Intel SDV with ... one thousand search loops of a 4MB file in a memory buffer done in 10 ... But since the performance bottleneck is the disk, ...
    (microsoft.public.vc.mfc)
  • SUMMARY: sd_max_throttle settings in /etc/system and in jni configuration file
    ... The original posting is shown at the end of this mail for reference. ... The box itself is now under low stress but the throughput has not increased. ... XP512 disk array is also connected to this dedicated disk SAN. ... LUNS presented to each port of the first card (i.e.., ...
    (SunManagers)
  • Re: high system cpu load during intense disk i/o
    ... And the fact that this happens only when running two i/o processes but when running only one everything is absolutely snappy, makes me sure that this is a kernel bug. ... This probably means that processor needs access to PCI bus in order to read ACPI timer register. ... 20GiB disk probably can send data at 20MiB/s rate. ... However I find it quite possible to have reached the throughput limit because of software problems. ...
    (Linux-Kernel)
  • Re: High-performance IO
    ... of disk IO under the following conditions. ... Currently I use four independent memory buffers ... perform explicit DMA transfers directly into the buffers. ... Is there a way to allocate a contiguous block of ...
    (microsoft.public.win32.programmer.kernel)

Loading