Decoding strategy
- From: marcin.rzeznicki@xxxxxxxxx
- Date: 9 Oct 2006 14:57:40 -0700
Hello everyone
I've got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files' contents are in different
encodings what I really do is mapping portion of file into memory and
then decoding that part using System.Text.Encoding. So far, so good,
but. It's not difficult to imagine serious problem with this approach.
Since file processing is not, and also cannot be, sequential and
furthermore, memory mapping limits offsets at which mapping can take
place, then some mapping can "tear" a character apart. How to deal with
this? I thought of implementing decoder fallback which would check few
bytes behind current mapping and would try to substitute unrecognized
chars, but I don't know whether it is feasible. I do not know if
decoder will not accidently mistake broken char for some valid, but
different from expected, character. I guess it depends on encoding
used. What do You think?
.
- Follow-Ups:
- Re: Decoding strategy
- From: Kevin Spencer
- Re: Decoding strategy
- Prev by Date: #region in ASHX Handler
- Next by Date: Re: Problems with code execution on Mobile devices
- Previous by thread: #region in ASHX Handler
- Next by thread: Re: Decoding strategy
- Index(es):
Relevant Pages
|