Re: Concurrently streaming a file to HttpResponse and file IO



Anders Borum wrote:
I'm implementing support for disk based caching of binary resources (blobs) residing in a SQL database.

You mean file-based caching, or tiered caching. Calling it "disk based caching" borders on paradoxical -- it's not as if the SQL database isn't stored on disk.

Just for completeness:
http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

Of course, I wouldn't hop on the SQL Server 2008 bandwagon just yet, but when it's a stable solution it might very well solve a few problems like this.

Please make sure you've eliminated possible sources of slowness in the blob solution first, like disk fragmentation, a database server that's running out of memory or I/O bandwidth, network congestion, poorly chosen network settings (like the packet size for the SQL connection), DB drivers with known performance issues and probably other things I'm forgetting. What you're doing in a case like this is essentially second-guessing the system that was designed to handle scalable access to your data. That's the sort of thing that should be completely defensible.

The case scenario is a number of concurrent requests (e.g. 10 or 20) asking for a resource (e.g. with a size of e.g. 32 MB) while the resource has not yet been cached on disk.

Again, in a file. Or better yet, "as a file".

Note that, as with all caching strategies, you're facing concurrency and data integrity issues w.r.t. updates to the blob data. The only way you're not facing these issues is if the data is never or hardly ever updated, but that raises the legitimate question of whether you should be storing it as a blob to begin with if you know separate files will give better performance (that has the advantage of simplifying backup strategies, but in many cases that's just not important, or not as important as performance).

In order to serve each request (in keeping the application responsive),
an approach is to start streaming the resource from DB to the client
request - and simultaneously queue a task to the threadpool that streams
the resource to disk.

But you're caching the thing to get a speedup improvement. While it's being written as a file, every client has to use the direct stream from the DB, thus not giving you a speedup improvement (in fact, probably a penalty because of the additional I/O for the caching). Subsequent requests from other clients might go faster, but you'd still need a good deal of measuring to find out if this is worth it. I/O generally doesn't parallelize very well unless you plan for it.

Another approach would be to receive the request and check for the file existense (using a cache for quick lookups). If not cached, then check whether a "streaming register" contains information about the current file currently being streamed to disk. If not in the "streaming register", then queue a task to the thread pool (each worker thread is resposible for registering / unregistering the current streaming process).

The thread pool doesn't sound like a good way of doing this. It's intended for short-lived tasks that preferably can survive indefinite queueing (if the TP is occupied) to maximize the potential for parallelism. Trouble is, sequentially writing a large file to disk is neither short-lived nor parallelizable. This is true even if you split the worker tasks up in asynchronous bits.

Assuming lots of clients want the same file, which is presumably the case you're optimizing for (as there's little point in optimizing the case where everyone wants a different file), it might make more sense to have a dedicated thread for the caching, writing one file at a time. You could make that X threads if your I/O subsystem can handle it, and you could also still use the thread pool, but then you should probably still keep track of how many requests you're issuing yourself.

What you don't want is your double quadcore server using 8 threads to pester the harddisk with chunks from 20 different files -- if you like responsiveness that much better than request time, forget the whole caching thing and directly stream from the DB, saving yourself the overhead.

I guess what I'm asking for are guidelines (or "do's" and "don'ts"). Am I working in the right direction? :-)

There's an easy and definite "do" in all this, and that's profile, profile, profile, with realistic and consistent setups, never profiling individual bits, always the whole system. Your intuition on what ought to work better is going to fail you sooner or later (usually sooner) and it's just too easy to waste time on complicated mechanisms that enable new ways of failure without offering concrete improvements.

In that vein, establish exactly what sort of performance baselines you're looking for/expecting before you go off to fight the neverending battle for truth, justice and improved performance, because it shouldn't be neverending.

--
J.
http://symbolsprose.blogspot.com
.



Relevant Pages

  • Re: Overlapped IO with error 0x800705AD
    ... could also get cached data without an access to an actual disk. ... and it's why I tried to disadvantage caching by avoiding to access preloaded ... are two, one high level object cache for perfs, and one low level sector ... And I have best perfs by queueing more requests by thread (best ...
    (microsoft.public.win32.programmer.kernel)
  • Re: efficacy of Linux w/o swap
    ... | Caching is supposed to happen, and this does not affect performance. ... that the impact starts to degrade the writing process itself. ... | Writing directly to disk will substantially decrease performance. ... Performance for the first 3/4 of an IDE disk ...
    (comp.os.linux.development.system)
  • Re: Concurrently streaming a file to HttpResponse and file IO
    ... I would like to keep the application responsive (continue to serve requests for resources) while streaming resources to disk. ... In order to serve each request, an approach is to start streaming the resource from DB to the client request - and simultaneously queue a task to the threadpool that streams the resource to disk. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Not able to disable disk write cache
    ... > I did all previous tests with NTFS with compression. ... > Now I prepared a CF disk with FAT and tested again ... > FlushFileBuffers, then cut power off, reboot, compare ... >> Anyhow you can't do anything to disk internal caching trough MS ...
    (microsoft.public.windowsxp.embedded)
  • Re: Slow Filesystem I/O
    ... Windows XP utilizes some form of write-through caching by ... default (if the disk controller supports it; ... That'd provide a more accurate view of the true capabilities of the ... platforms) with tools like iozone. ...
    (comp.os.vms)

Loading