Re: Decoding strategy



<marcin.rzeznicki@xxxxxxxxx> wrote in message
news:1160509674.674707.253020@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I agree with you. Especially second point is what I struggle to
achieve. I think that there is also other advantage, which lies in
explicit access of "memory buffer". Since I get pointer (it is unsafe I
know :-) ) to contiguous memory I save one copy operation each time I
need to map portion of file into memory.

That's true. But since your use of the file is non-trivial, it is likely
that the copying of data from one memory location to another will not
dominate the performance of your program.

In other words, worry about that bridge when you come to it. First step is
to get something that works. :)

Reason being, FileStream, even
though using buffering, does not give me access to it.

It doesn't give you direct access, you're right. But merely by reading from
the file in large chunks at a time, even if it does so in a way opaque to
your own code, performance may well be acceptable.

Keep in mind that if you are not reading from the file in a purely
sequential way, even memory mapping the file may or may not buffer in a way
that optimizes your access to the file.

[...]
Any i/o speed advantage you can get with memory mapping, you can get with
normal file i/o using appropriate techniques.

Not with FileStream I fear.

But your fears might be unfounded. I can't really say for sure one way or
the other without having a full-blown implementation in my hands to look at.
But getting data from the hard disk is going to be a major bottleneck, as
will sifting through it after it's been safely stored in memory. As long as
that data has been buffered somewhere, it may not really matter that it gets
copied one or two extra times once in memory.

[...]
I'm not entirely sure I understand the question. Even using a memory
mapped
file, if you jump into a random location in the middle, you can't tell
whether you're at the beginning of a new character or in the middle of
one.
You need some point of reference to tell the difference.

Obviously true. I build for myself character index, which tells me
approximately where to seek given character.

How large are these indexes? You might keep in mind that consuming RAM in
the form of an index is likely to interfere with the memory mapped file in
at least a couple of ways: one, by fragmenting your virtual memory space
(thereby limiting the size of the file you can deal with) and two, by
consuming physical RAM to deal with the indexes you may wind up flushing
file data out of physical RAM sooner than you'd like.

The latter issue is a problem whether you're using memory mapping or not, so
I'm not trying to say this is a significant factor in deciding between the
two. My main point is that the indexes are one thing that may cause more
disk i/o to occur, and thus further reducing the significance of any
additional memory-to-memory data copies.

[...]
Yes, I agree. That;s why I asked Kevin whather he sees some magical way
by which FileStream will get things right. So, I do not think that
using FileStream, or any othr i/o strategy for that matter, will help
me in my problem

Well, one advantage of using the FileStream class is that since you need to
do more explicit handling of the file i/o, it gives you an opportunity to
address the issue you're asking about.

That said, it seems to me that in terms of the specific question you're
asking, memory mapped file i/o is the best solution. It has its
limitations, as you've already pointed out, but if you can live with those
limitations then it's a good solution.

However, that's not how I interpreted the question you asked. My apologies
if I misunderstood, but the way I read it is that you've stated the
limitations of the memory mapped file i/o and are looking for a means around
it. The only way around it is to use more conventional file i/o, in the
form of the FileStream class or something similar.

[...]
Right, so here you come to the point where my doubts are born :-)
First of all - what's the best way to create small buffer - whether
decoder fallback, or maybe some other strategy will do better.

IMHO, the first thing you should do is try just using a FileStream directly.
Give it some reasonably large buffer size to use (at least a handful of file
blocks, which are usually 4K each), and read data from the file as you need
it. Even if this means reading just a small number of bytes at a time,
between one and four depending on where your encoder is and what data is
being processed.

For example, if your decoder would get "cbDecode" bytes from offset
"ibDecode" (I have no idea how you do this in your code...maybe if you could
post line or two that demonstrates how you actually access the data, that
would be useful), you could do this instead with a FileStream (let's call it
"fsDecode"):

byte[] rgbDecode = new byte[cbDecode];

fsDecode.Seek(ibDecode, SeekOrigin.Begin);
fsDecode.Read(rgbDecode, 0, cbDecode);

Then you've got your bytes in the byte array ready for processing. There's
no tearing issue, and most of the time the read will come from memory,
buffered by the FileStream object. The biggest problem here would be the
high overhead from calling Seek and Read over and over. But it's a nice
simple approach. :)

(A side note: you may actually find the BinaryReader class more suitable, as
the FileStream.Read method can in theory actually return fewer bytes than
you ask for, even if you don't reach the end of the file...I left out the
return value checking for simplicity, but you might need to include that if
you don't use BinaryReader. BinaryReader.ReadBytes will always return as
many bytes as you ask for, unless it reaches the end of the file and can't).

Once you've done that, then you've got your worst-case scenario. That's
likely to be the poorest-performing way to read the file, and if it turns
out to be fast enough, you can just stop right there. :)

If you find that's too slow, then you can accomplish pretty much the same
performance gain you might get from a memory-mapped file (or possibly even
better, depending on what sort of buffering Windows was capable of doing
with your memory-mapped file) by reading the file directly in larger chunks.
If you do that, then yes...you need to worry about the data you're
processing straddling whatever artificial boundary you wind up imposing by
adding the extra layer of buffering in your own code. But that is a
solvable problem (and in fact will be solved in a very similar way to what
the memory-mapped solution has to do behind the scenes for you anyway). If
that last sentence causes you some questions, let me know and I can
elaborate.

Or maybe
I screwed up everything and there is better solution.
And - is it always possible (keep in my mind that some encodings migh
not be so nica as Unicode encodings) to reconstruct character?

I don't know. That's a somewhat different question and doesn't have much to
do with the file i/o method you use. I don't have a lot of experience with
multibyte character encodings, but as far as I recall from my limited use of
them, an initial byte always looks different from a subsequent byte within a
given character. So you can always work your way backwards to find an
initial byte and start decoding from there.

I do not
know much about encodings in general, but while pondering on this idea
I decided to check a few encodings and see whether I am right. I came
across *** JIS encoding, which, I fear, can mistake "torn" character
for a different one.

I hope that's not the case, but if it is you have that issue whether you use
memory-mapped file i/o or not. Or alternatively, if you think that
memory-mapped file i/o solves that issue, maybe if you explain why it is you
think that, it would help us understand your question better. :)

Pete


.