Re: How to read html files AS IS. Encoding seems to change the characters.



Zoro wrote:
My task is to read html files from disk and save them onto SQL Server
database field. I have created an nvarchar(max) field to hold them.
The problem is that some characters, particularly html entities, and
French/German special characters are lost and/or replaced by a
question mark.
This is really frustrating. I have tried using StreamReader with ALL
the encodings available and none work correctly.

Have you really tried with ALL available encodings, or just the ones that are predefined in the Encoding class? I counted to 140 supported encodings in the documentation, of which 36 are supported on all systems. You can create an Encoding object for any of those.

Examples:

Encoding windows1252 = Encoding.GetEncoding(1252);
Encoding dosWesternEuropean = Encoding.GetEncoding(850);
Encoding iso8859_1 = Encoding.GetEncoding(28591);

ISO-8859-1 is the default encoding for the MIME type text/html. That is the content type usually used for web pages, so my first guess would be that the files are encoded that way.

Each encoding handles
some characters and but loses others. I also tried reading into byte
array, but as soon as I converted the array to string the encoding
ruined the text.
Maybe the solution is not to convert to string? but then how will I
save it to the database?

Is there a way to get this html text AS IS - with no encoding and no
changes into the database?

As the file is not text at all, but only binary data, there is no way of decoding the data into text without decoding it.

I could do it in Delphi and many other pre .NET, there must be a way
in C# too - surely?!

Of course there is. You just have to find out what encoding was used to create the file, then you can decode it.

--
Göran Andersson
_____
http://www.guffa.com
.



Relevant Pages

  • Re: Loading a data file containing character fields with different encodings
    ... The data is coming from one database that contains UTF-8 characters and it appears that he's attempting to load ... UTF-8 characters along with Latin-1 characters. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ...
    (comp.databases.informix)
  • Re: Loading a data file containing character fields with different encodings
    ... The data is coming from one database that contains UTF-8 characters and it appears that he's attempting to load ... UTF-8 characters along with Latin-1 characters. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ...
    (comp.databases.informix)
  • Re: [PHP] Preventing SQL Injection/ Cross Site Scripting
    ... It's a shame that so many PHP installations have them enabled, and a huge disappointment that PHP is actually distributed with this stuff enabled! ... encoding data for output to an HTML document. ... characters into 5, 6, or 7-byte strings, if you already provided the correct character set in the Content-Type HTTP header. ... For anything that gets written to a database or used for a query, I suggest escaping the data using a function specifically designed for that database. ...
    (php.general)
  • Re: [PHP] Preventing SQL Injection/ Cross Site Scripting
    ... It's a shame that so many PHP ... encoding data for output to an HTML document. ... characters into 5, 6, or 7-byte strings, if you already provided the ... anything that gets written to a database or used for a query, ...
    (php.general)
  • Re: VW --> Bianary file into CLOB or BLOB
    ... Page 1252 encoding, which is normal Windows encoding for Western European ... the database as I manage this on both PostgreSQL, Oracle and ODBC ... My outstanding problem was to create a valid ByteArray from no matter ...
    (comp.lang.smalltalk)

Loading