Re: Help with Streams



Jonathan Wood wrote:
Hi Peter,

Hmmm...well, looking at the StreamReader class in Reflector, it looks to me as though it doesn't store any state information regarding the current reader position versus the stream's position. DiscardBufferedData() simply causes any data retrieved from the stream but not yet read as characters to be lost.

Yup, that's how it appears.

Unfortunately, thinking about it more I think that you are running into a more fundamental issue. Consider why StreamReader might not support this particular kind of usage. In particular, it's actually extremely inconvenient to maintain a mapping between the reader and stream positions, and doing so would perform very poorly in any case, because you would have to decode the bytes to characters one at a time. You could still buffer the stream data into a byte buffer, but even the overhead of having to call the encoder one character at time would be very noticeable. This is especially problematic for character encodings where you have variable-length characters (e.g. UTF-8, which is the default encoding for .NET), since for characters longer than one byte, you'd have to try to decode the one character as many times as there are bytes in the character (i.e. keep trying until you get a completed character).

Well, the C libraries handled this, although there were some quiks in text mode, which did stuff like translate \r\n to \n. I assumed the reason it wasn't supported was because the underlying stream might not know where the data started or ended, depending on the source. At any rate, it doesn't support it.

So, you could reimplement a special-purpose reader that supported the functionality you wanted. It wouldn't even be that hard. But, it would be at least a little awkward to write, and could perform very poorly as well. It might perform so poorly that you'd find yourself deciding to just reimplement the decoder too, so that you can combine the stream i/o and decoding into a single operation where there is always direct access to information about the stream position for the current character being decoded.

That might be a little beyond my current knowledge of the .NET frameworks. I'd hate to implement my own buffering and line routines. It'd probably be easier to just open the file twice and have my hash routine figure out where it needs to go.

You aren't specific about how you're using the data, but depending on your needs you might be able to approach the problem from a different angle and achieve the same results. In particular, it's not clear from your question whether you are actually doing something with the characters you read, or if you are just reading them to find a particular spot in the file. If it's the latter, then you could actually encode the search string itself into the bytes representing that string, and then scan the stream bytes for a matching sequence of bytes. That way, you're never actually decoding the bytes from the stream at all.

The first line is a header, which contains a hash code on the rest of the file. I need to verify the hash code. If the code is good, I need to read the rest of the lines, which I would be doing something with.


In order to verify the hash code, you'll need to read the entire file once anyway.

Then if it checks out, you use the bytes in the file after the position of the end-of-first-line as strings.

You could either:

A)
1. Read the first line of the file, parse the header, and store the location of the end-of-first-line (EOFL)/beginning-of-bytes (BOB) (EOFL+1)
2. Read the bytes of the file starting from BOB, and compute the hash (also, considering other data for the hash such as salt from the header, etc)
3. Compare expected hash value to actual hash value.
4. If it matches, then read the lines from the file, skipping/discarding the first/header.

B)
1. Read the entire file using StreamReader
2. Discard the first line/header
3. Assuming you know the encoding that you just read, and that encoding the string to bytes again will produce the same results, perform a hash compute on the encoded string back to the original encoding. (using this approach will not guard against collisions if different byte contents of the file decode to the same string)
4. If the expected hash matches the actual hash, keep the string. Otherwise, discard it.

At this point, I'm thinking the best approach might be to open the file (not using StreamReader) and manually parse out the first line and extract the hash code, and then run the hash on the rest of the file. Then close the file and reopen is using StreamReader. That's not ideal but pretty straight forward.

Method A will open the file twice, sure, but method B will allocate memory for the entire string, which if it is a large file, might not be what you're looking for. If your hash algorithm is operating on a byte array and not a stream, however, there might not be many disadvantages to method B, as you'll have just read all those bytes in to memory for method A anyway.


Thanks!

Jonathan


Cheers,

-- Maxwell
.



Relevant Pages

  • Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
    ... ;; This function returns TRUE if any character ... (if (char< char #\!) ... a stream and an array to hold characters in temp memory. ... ;; resulting string. ...
    (comp.lang.lisp)
  • "Read stuff from a file and chop it up to do stuff" code advice wanted.
    ... ;; This function returns TRUE if any character ... a stream and an array to hold characters in temp memory. ... ;; resulting string. ... Push new-char ...
    (comp.lang.lisp)
  • Re: [SOLUTION] DictionaryMatcher (#103)
    ... characters in the string to find in the dictionary. ... A hash table is ... O, which is theoretically faster than a trie, but in practice the ... dictionary is organized into a hierarchy of character codes. ...
    (comp.lang.ruby)
  • Re: [SOLUTION] DictionaryMatcher (#103)
    ... characters in the string to find in the dictionary. ... A hash table is ... O, which is theoretically faster than a trie, but in practice the ... dictionary is organized into a hierarchy of character codes. ...
    (comp.lang.ruby)
  • Re: ReplacerStream
    ... string, do a replace on that string and create a stream again to be ... If those are problems, and you are looking just for a single string, it seems to me that you could just read the stream one character at a time, checking to see if it matches the current character in your search string. ...
    (microsoft.public.dotnet.framework)

Loading