Re: How to read html files AS IS. Encoding seems to change the characters.
- From: Göran Andersson <guffa@xxxxxxxxx>
- Date: Sat, 31 Mar 2007 22:21:04 +0200
Zoro wrote:
My task is to read html files from disk and save them onto SQL Server
database field. I have created an nvarchar(max) field to hold them.
The problem is that some characters, particularly html entities, and
French/German special characters are lost and/or replaced by a
question mark.
This is really frustrating. I have tried using StreamReader with ALL
the encodings available and none work correctly.
Have you really tried with ALL available encodings, or just the ones that are predefined in the Encoding class? I counted to 140 supported encodings in the documentation, of which 36 are supported on all systems. You can create an Encoding object for any of those.
Examples:
Encoding windows1252 = Encoding.GetEncoding(1252);
Encoding dosWesternEuropean = Encoding.GetEncoding(850);
Encoding iso8859_1 = Encoding.GetEncoding(28591);
ISO-8859-1 is the default encoding for the MIME type text/html. That is the content type usually used for web pages, so my first guess would be that the files are encoded that way.
Each encoding handles
some characters and but loses others. I also tried reading into byte
array, but as soon as I converted the array to string the encoding
ruined the text.
Maybe the solution is not to convert to string? but then how will I
save it to the database?
Is there a way to get this html text AS IS - with no encoding and no
changes into the database?
As the file is not text at all, but only binary data, there is no way of decoding the data into text without decoding it.
I could do it in Delphi and many other pre .NET, there must be a way
in C# too - surely?!
Of course there is. You just have to find out what encoding was used to create the file, then you can decode it.
--
Göran Andersson
_____
http://www.guffa.com
.
- References:
- Prev by Date: Re: Threaded TCP socket program incorrectly reporting establishment of connection
- Next by Date: Re: dialog's Form.FormClosed is not run
- Previous by thread: Re: How to read html files AS IS. Encoding seems to change the characters.
- Next by thread: Ienumberable<T> / Event Handler design question...
- Index(es):
Relevant Pages
|
Loading