Re: How to read html files AS IS. Encoding seems to change the characters.

Tech-Archive recommends: Fix windows errors by optimizing your registry



Zoro wrote:
Thanks again Goran for your help.

You are writing it back as UTF-8, as you are not specifying any encoding
in the WriteAllText method call.

It looks like I may be able to do it with string after all.
It looks like the problem with my test was - like you suggested - that
i didn't specify the write encoding. When I do, as long as I use the
same encoding when reading and writing, it worked with all 3 codes you
have suggested (but not with any of the built in codes - e.g. UTF-n!).

Then you have successfully decoded the file into text, as you are not losing any characters.

If you save the file using utf-8, all the characters will still be there, as strings are unicode and utf-8 can store any unicode characters. The reason that your test did not succeed with the unicode encodings is because the utility that you are using doesn't support unicode. You would need the "Pro" version for that.

I am still not clear on how it's going to work away from the test -
using the database situation, but I am HOPING it would work as
follows:
1. I will use 1 of these codes to read the file
2. then store the string into nvarchar field and add a note informing
users of the encoding I used
3. specify the same encoding when creating the file, after reading the
string from the db.

Do you think this would work?
Thanks again,
zoro.


As you successfully decoded the file to a string, you can store that in a nvarchar/ntext field and you are done. You can also store the encoding used if you like to recreate the file exactly, but you can create a file using any encoding that supports the characters in the text.

One advantage with using utf-8 encoding is that it places a BOM (byte order mark) at the beginning of the file, that can be used to identify the encoding used. If you use the File.ReadAllText to read a file that contains a BOM, it will read the file correctly, even if you specify a completely different encoding.

--
Göran Andersson
_____
http://www.guffa.com
.



Relevant Pages

  • Re: Data source not "sticking"
    ... First, as far as I am aware, when the data source is a text file and you ... have to specify the encoding, Word does not store the information about the ... encoding when it saves the mail merge main document. ... XML format (i.e. not the sort of simple XML format doucment you might ...
    (microsoft.public.word.docmanagement)
  • Re: utf8 silly question
    ... You can first convert your c string to unicode, ... specify an encoding that understands non-ASCII characters (if you don't ... Then you can utf8-encode the c string via the codecs module. ...
    (comp.lang.python)
  • Re: Uniquely identifying Sudoku grids
    ... |> uniquely specify every possible sudoku. ... |So you'd need 22 digits to uniquely specify a sudoku puzzle. ... deck of 52 cards +2 jokers, how many poker hands are ... or some such encoding that can be easily ...
    (rec.puzzles)
  • Re: =?UTF-8?B?77u/V2hlbg==?= X.ZIP downLoads a post or eMail, Windows-1252 is the default.
    ... It's very common for a newsReader to specify ISO-8859-1 ... doesn't fit into the ISO standards. ... | When a user agent would otherwise use an encoding given in the ... mis-identified as using ISO standard character encodings. ...
    (news.software.readers)
  • Re: Print Spanish characters in Perl?
    ... and ensure that your file is saved in the UTF-8 format. ... encoding then your display device expects. ... forgetting to specify UTF-8 as charset. ... To avoid this kind of problem, make sure that all the characters are ...
    (comp.lang.perl.misc)