Re: Decoding strategy
- From: marcin.rzeznicki@xxxxxxxxx
- Date: 10 Oct 2006 12:47:54 -0700
Peter Duniho napisal(a):
<marcin.rzeznicki@xxxxxxxxx> wrote in message
news:1160502641.574989.204890@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I didn't test performance with FileStream, but maybe you can confirm -
Does File Stream caches contents of file in memory?
FileStream does buffer, which is in a sense a kind of caching. You can
specify the buffer size when you create the FileStream.
I think there is
slight speedup when using memory mapping in that I do not have to hit
the disk all the time.
IMHO, the two major benefits to memory mapping are 1) convenience (as long
as your file access fits within the addressable space available to you), and
2) minimal and efficient virtual memory usage (the physical memory storage
of the data can be backed by the file itself, rather than using up swap file
space).
I agree with you. Especially second point is what I struggle to
achieve. I think that there is also other advantage, which lies in
explicit access of "memory buffer". Since I get pointer (it is unsafe I
know :-) ) to contiguous memory I save one copy operation each time I
need to map portion of file into memory. Reason being, FileStream, even
though using buffering, does not give me access to it. Then to perform
subsequent decoding I have to copy data from FileStream into byte array
and pass it into decoder, on the other hand, I pass pointer to memory
view of file directly into decoder.
Any i/o speed advantage you can get with memory mapping, you can get with
normal file i/o using appropriate techniques.
Not with FileStream I fear.
In my solution I simply open mapping over whole
file and create views as needed. Anyway, let's say that I did it using
FileStream, I can read some bytes from it, but I still face the same
problem - how to interpret first bytes I have read, whether they are
beginning of character, or maybe end of "previous" character?
I'm not entirely sure I understand the question. Even using a memory mapped
file, if you jump into a random location in the middle, you can't tell
whether you're at the beginning of a new character or in the middle of one.
You need some point of reference to tell the difference.
Obviously true. I build for myself character index, which tells me
approximately where to seek given character. When opening file I decode
each block of file and ask decoder to tell me how many chars are found
in each and every block of file. Then I buld data structure like this
(100, 200, ..., 5000) which means: chars 0-99 are in the first block
100-199 in the second and so on. Then, when I have to read string
starting at, let's say, 250th character, simple index lookup tells me
that I should start mapping at 2nd block. After mapping I decode
contents and calculate needed offset
If the file is entirely made up of contiguous Unicode characters, and thus
each character always starts on an even offset from the start of the file,
then that's one easy way to tell when you are at the beginning or middle of
a character. If that's the case though, then you could easily preserve that
characteristic even reading the file using FileStream.
Yes, but it's not the case
On the other hand, if you are dealing with some other multibyte character
set, or it's all Unicode but there's other data that can cause the Unicode
characters to get shifted to odd offsets, then even using memory mapped
files you need to find a good point of reference before you decide whether
you're dealing with the start of a Unicode character.
I am using index whic I described above for that "point of reference"
Basically, I don't see how using the FileStream class versus using memory
mapping alters the underlying issue of determining what the character
boundaries are. You can read sections of the file using FileStream, and as
long as you keep track of what absolute file position those sections come
from, you can always translate the address of a byte from a partial section
back to an absolute file position, giving you the exact same position
information you'd have when using memory mapping.
It *is* true that reading the file into buffers by sections using the
FileStream class, you could wind up with partial data at the beginning of
end of one of these sections. The question there though is not knowing what
you've got (since as I point out above, you can just as easily determine
that whether using FileStream or memory mapping), but rather how to get back
the other part. To deal with that, you'd need additional layer of
processing that can piece together these data that straddle read boundaries.
Yes, I agree. That;s why I asked Kevin whather he sees some magical way
by which FileStream will get things right. So, I do not think that
using FileStream, or any othr i/o strategy for that matter, will help
me in my problem
I agree that this is an area in which memory mapped files are more
convenient, but it shouldn't be that hard for you to maintain a small
"workspace" buffer in which this sort of reconstruction can take place. In
the simplest case, it need only be a single "char" in which you pull out one
byte at a time from the buffer read by FileStream and combine them as pairs
into the "char" buffer (that may or may not be efficient, depending on what
level at which you're processing the data...if you have to look at each and
every character anyway, it may not be all that bad).
Right, so here you come to the point where my doubts are born :-)
First of all - what's the best way to create small buffer - whether
decoder fallback, or maybe some other strategy will do better. Or maybe
I screwed up everything and there is better solution.
And - is it always possible (keep in my mind that some encodings migh
not be so nica as Unicode encodings) to reconstruct character? I do not
know much about encodings in general, but while pondering on this idea
I decided to check a few encodings and see whether I am right. I came
across *** JIS encoding, which, I fear, can mistake "torn" character
for a different one.
Pete
Thanks for helpful reply.
.
- Follow-Ups:
- Re: Decoding strategy
- From: Peter Duniho
- Re: Decoding strategy
- From: marcin . rzeznicki
- Re: Decoding strategy
- References:
- Decoding strategy
- From: marcin . rzeznicki
- Re: Decoding strategy
- From: Kevin Spencer
- Re: Decoding strategy
- From: marcin . rzeznicki
- Re: Decoding strategy
- From: Peter Duniho
- Decoding strategy
- Prev by Date: Re: Need Help Converting VB6 to VB.NET
- Next by Date: Re: Need advice with huge TreeView performance
- Previous by thread: Re: Decoding strategy
- Next by thread: Re: Decoding strategy
- Index(es):