Re: Decoding strategy



I would use a FileStream instance to read the file. The FileStream class
supports random access to files, allowing you to jump around in the file.
You can read as little or as much as you want into memory when you need to.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Shooter
http://unclechutney.blogspot.com

A man, a plan, a canal, a palindrome that has.. oh, never mind.

<marcin.rzeznicki@xxxxxxxxx> wrote in message
news:1160431060.920574.5670@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hello everyone
I've got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files' contents are in different
encodings what I really do is mapping portion of file into memory and
then decoding that part using System.Text.Encoding. So far, so good,
but. It's not difficult to imagine serious problem with this approach.
Since file processing is not, and also cannot be, sequential and
furthermore, memory mapping limits offsets at which mapping can take
place, then some mapping can "tear" a character apart. How to deal with
this? I thought of implementing decoder fallback which would check few
bytes behind current mapping and would try to substitute unrecognized
chars, but I don't know whether it is feasible. I do not know if
decoder will not accidently mistake broken char for some valid, but
different from expected, character. I guess it depends on encoding
used. What do You think?



.



Relevant Pages

  • Re: ten thousand small processes
    ... Stack needs to be executable for the current signal trampoline ... the use of malloc() that is causing your primary ... if there is any heap memory in use at all, no matter what you do, ... either directly, as a 4M page mapping (not used for user processes, ...
    (freebsd-performance)
  • [PATCH, RFD]: Unbreak no-mmu mmap
    ... Here's a patch to move nommu mmap/munmap ever so slightly closer to mmu ... If this flag is used, more memory is kept available, but fragmentation ... Every VMA can be in two states: either it manages a power-of-2 sized compound ... mapping we're capable of supporting ...
    (Linux-Kernel)
  • [PATCH] NOMMU: Unbreak no-mmu mmap
    ... Here's a patch to move nommu mmap/munmap ever so slightly closer to mmu ... If this flag is used, more memory is kept available, but fragmentation ... Every VMA can be in two states: either it manages a power-of-2 sized compound ... mapping we're capable of supporting ...
    (Linux-Kernel)
  • Re: How to efficiently handle DMA and cache on ARMv7 ? (was "Is get_user_pages() enough to prevent p
    ... Current silicons probably avoid prefetching memory at random. ... synonyms in a virtual to physical address mapping, ... Marking the userspace mapping as non-cacheable might bring a performance ... unmapped from userspace virtual memory even though get_user_pagesprevent ...
    (Linux-Kernel)
  • Re: File map performance
    ... could recommend me some tutorials about the best practices of file mapping, since I think if I develop straight forward way of mapping portion of a file which is needed currently, close un-needed mapping, such kinds of straight forward way is not of the best performance. ... When you map a view, unmap the view, and then map the view of the same area again, the data are still in memory unless you are running out of RAM. ... You will get a page fault when accessing the data again, however only a minor one (no disk access, only mapping the memory into the process address space), not a major one. ...
    (microsoft.public.vc.language)