Re: Decoding strategy
- From: marcin.rzeznicki@xxxxxxxxx
- Date: 11 Oct 2006 13:19:55 -0700
Hi
[...]
It's hard to say for sure without all the details...I'm just pointing out
that these caching issues exist whether you're using a memory-mapped file or
just reading normally.
In general, form what I observed the most frequent access is random yet
with locality pattern. So I start with random part of file, mess around
not too far away from beginning of mapping, and then jump somewhere
else. So, memory mapped view is sure too be usable for a while, so I
think it pays off too keep that in memory. That characteristic also
ensures me that OS cache can be helpful and performance will not suffer
from misses/disk reads very often.
[...]
Yes, pure advanatage of FileStream I see so far, is that it enables
file access at any offset, so tearing problem can be prevented. Tearing
problem is born because you have to map file at offsets aligned to
allocation block boundary. But that would not be really much if I knew
that I could solve decoding problems reliably.
From this, I think that I may still not fully understand the question.
It is true that a file must be mapped to an aligned memory address. But
this should only affect the virtual address used to locate the file in the
virtual address space. That is, the first byte of the file will be on an
aligned address, but the rest of the file is contiguous from there.
Yes, I know. I was referring to something else. Sorry for being
unclear. Docs say:
"(..)must specify an offset within the file that matches the memory
allocation granularity of the system, or the function fails. That is,
the offset must be a multiple of the allocation granularity".
I know that this is going to be aligned to sth in VM, but I do not
care, this is transparent unless you write kernel-mode stuff or,
generally, very low level stuff. What I do care is that I cannot choose
FILE offset at wich mapping starts. And that leads to "tearing"
[...]
Judging from this:
That's what I wrote, except for the part "looking for a means around".
Well, depends on what you mean by this, but I'd not rather disband
memory mapping. So I am not looking for "means around memory mapping"
but: living within memory mapping walls, how can I solve the "tearing"
problem?
I'm guessing that both of those issues are really just the same problem for
you. That is, that you have to address the file using non-contiguous
pointers.
Mm, now I do not understand :-) Memory I map is guaranteed to be
contiguous, it does not span the whole file, but contents mapped
(current "page" in my code, if you will) have to start at specific
offset in file - here I seek the rot of all evil :-)
Certainly it is :-)
That's how I wanted to implement fallback buffer. Each time I detect
"torn" char I reposition file pointer, probe bytes backward till I find
valid char and provide replacement.
Perhaps you could clarify under what situation you "detect a 'torn' char".
That is, it's unclear to me whether you are referring to simply jumping into
an offset that's not the start of a character, or if this somehow
specifically relates to the sectioning of the file caused by your memory
mapped i/o.
Well, yes, it can be thought of as jumping into character, which, in
turn, is related to sectioning :-) How I detect? Hmm, that's
interesting question. I am not sure, if I were to used DecoderFallback
then that detection would happend by means of decoder itself.
The former would be an issue even if you could map the entire file to a
single contiguous virtual address range. The latter is obviously only an
issue because of the sectioning of the file. I'm confused as to which it
is.
Hope I clarified :-)
[...]
Yeah, nice solution. Even though performance hit may be noticeable, if
I restrict these operations to fallback times only and extend my index
structure to cache "torn" characters I should not need to execute that
code very often. Seems good to me. Yet ... :-( Can I be sure whether
decoder cannot mistake characters?
Well, as I mentioned...I can't help you with that question. :) That
depends on the nature of the data you're decoding, and I don't know enough
to be able to answer that.
Data is simply plain text file with some human readable text.
If you find that's too slow, then you can accomplish pretty much the same
performance gain you might get from a memory-mapped file (or possibly
even
better, depending on what sort of buffering Windows was capable of doing
with your memory-mapped file) by reading the file directly in larger
chunks.
If you do that, then yes...you need to worry about the data you're
processing straddling whatever artificial boundary you wind up imposing
by
adding the extra layer of buffering in your own code. But that is a
solvable problem (and in fact will be solved in a very similar way to
what
the memory-mapped solution has to do behind the scenes for you anyway).
If
that last sentence causes you some questions, let me know and I can
elaborate.
Well, yes, please. If you are able to show me how to solve that, then I
can mix memory mapping with direct file access at fallback times and be
perfectly happy.
Okay, let's see if this makes sense. First, keep in mind that my comment
was assuming a general solution to the file i/o problem. I think you should
be able to apply it as a "fallback" solution, but it may or may not be
better than just falling back to reading a few bytes at a time if that's
your approach.
That's what I planned to do. Apply it as "fallback", but I think we
somehow disagree on meaning of "fallback". I used that word in terms of
"decoder fallback", speaking C# - it is an instance of DecoderFallback
class, speaking more generally, something which provides replacement
chars to decoder when it cannot, for some reason, decode a sequence. I
somehow suspect that you used that as "another plan". So, I planned to
use your solution as part of DecoderFallback implementation, which will
read few bytes back and try to concatenate these with bytes from
beginning of mapping.
[...]
That said:
What I meant was that you can read from the file a few blocks at a time,
keeping that buffer centered on where you are currently accessing. You'll
need to keep track of:
-- current file offset
-- an array of blocks read from the file
-- the file offsets those blocks came from
-- the current block
The general idea is to maintain the array of blocks such that there is an
odd number of blocks, at least three, and they are centered on the current
offset within the file you're reading. Normally, you'll be reading from the
middle block. If you skip over to another block, you drop one block from
the far end of the array, and read another adding it to the near end of the
array.
Basically, you're windowing the file in a fixed set of buffers. If you read
new data asynchronously to your use of the data in the buffers you currently
have, then when you drop a block at one end and fill it for use at the other
end, the file i/o can happen while you're still processing the data that you
do have.
Obviously if you jump to a completely different point in the file, you'll
have to wait for the surrounding data to be read, but that's an issue even
if memory mapped files or just reading directly with a FileStream.
Well, that's very close to what I have now. Let me specify the details.
I read few "blocks" a time, namely 4, which is 256kb of data (block for
me is memory allocation granularity, as that it is the smallest
addressable part of file when it comes to memory mapping). I try to
adjust offset a little, so that: I always read the whole data I am
requested, and, immediate reads in the neighbourhood will not cause
remapping, whch is close to your idea. But then, how do you know
whether the very first byte of current "window" is the first block of
character?
[...]
After little afterthought I've found that it is the most significant
question. But let me rephrase what you wrote: it is no problem to find
characters when reading byte sequence forward and every sane encoding
must adhere to this in order to be usable. But is it the same case when
looking backward?
I still don't know. :) I suspect that it is, because it was true with the
basic MBCS I've seen. But I also realize that there are a LOT of different
ways to encode text, and some may be context-sensitive.
:-( That's pain in the ass for me. If I knew that I could always look
back for missing parts of single character, then mix of your solution
with memory mapping would be the best scheme
Some of this really depends on what you mean by "encoding" and "decoding".
The word "encoding" is applied in a variety of ways. Two that could apply
here are the basic idea of text encoding, which mostly just has to do with
the character set, or some actual conversion of data, which has to do with
compressing the data, or translating it into a more portable format (MIME,
for example). I don't even know which of these meanings you're addressing,
making it even harder for me to know the answer. :)
I meant "basic idea of text encoding" :-)
[...]
So, summing up. I think that question reduces to the one about encoding
characteristics. You showed us very good solution using FileStream. It
can be extended to mix these two approches which may be faster but I
still do not know whether it is realiable.
Indeed, that is a question you should probably figure out. Earlier rather
than later. :) Sorry I can't be of more help on that front.
Pete, first of all thank you for wonderful discussion, it was really
helpful. And I hope you'll add something more after reading the code
:-)
Pete
.
- Follow-Ups:
- Re: Decoding strategy
- From: Peter Duniho
- Re: Decoding strategy
- References:
- Decoding strategy
- From: marcin . rzeznicki
- Re: Decoding strategy
- From: Kevin Spencer
- Re: Decoding strategy
- From: marcin . rzeznicki
- Re: Decoding strategy
- From: Peter Duniho
- Re: Decoding strategy
- From: marcin . rzeznicki
- Re: Decoding strategy
- From: Peter Duniho
- Re: Decoding strategy
- From: marcin . rzeznicki
- Re: Decoding strategy
- From: Peter Duniho
- Decoding strategy
- Prev by Date: Re: Scale value using double.ToString()
- Next by Date: Re: when casting, receiving: System.InvalidCastException (Specified cast is not valid)
- Previous by thread: Re: Decoding strategy
- Next by thread: Re: Decoding strategy
- Index(es):
Relevant Pages
|