Decoding strategy



Hello everyone
I've got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files' contents are in different
encodings what I really do is mapping portion of file into memory and
then decoding that part using System.Text.Encoding. So far, so good,
but. It's not difficult to imagine serious problem with this approach.
Since file processing is not, and also cannot be, sequential and
furthermore, memory mapping limits offsets at which mapping can take
place, then some mapping can "tear" a character apart. How to deal with
this? I thought of implementing decoder fallback which would check few
bytes behind current mapping and would try to substitute unrecognized
chars, but I don't know whether it is feasible. I do not know if
decoder will not accidently mistake broken char for some valid, but
different from expected, character. I guess it depends on encoding
used. What do You think?

.



Relevant Pages

  • Re: Decoding strategy
    ... I would use a FileStream instance to read the file. ... You can read as little or as much as you want into memory when you need to. ... text encoded with various encodings. ... problem by implementing memory mapping using P/Invoke and I load ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: mmap(2) past file
    ... is it legal to specify a mapping length more ... Given the inherently broken version of memory mapping on nearly all unices, ... un map, seek to end and write new bytes, mmap the extended file again. ...
    (comp.unix.programmer)
  • Re: Remapping a file using mmap() question
    ... Boltar wrote: ... fixed length fields - perfect for memory mapping. ... don't want to just munmap and mmap again). ...
    (comp.unix.programmer)
  • Re: about statically mapped virtual address
    ... This is a FALSE statement about the static mapping. ... An OEM can set up the ... memory mapping in customized ways so you cannot make assumptions about the ...
    (microsoft.public.windowsce.platbuilder)