Re: Read UTF8 (mixed byte) file & convert to Unicode

Tech-Archive recommends: Speed Up your PC by fixing your registry



<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter
Beanland@xxxxxxxxxxxxxxxxxxxxxxxxx>> wrote:
> I have a file which has no BOM and contains mostly single byte chars. There
> are numerous double byte chars (Japanese) which appear throughout. I need to
> take the resulting Unicode and store it in a DB and display it onscreen. No
> matter which way I open the file, convert it to Unicode/leave it as is or
> what ever, I see all single bytes ok, but double bytes become 2 seperate
> single bytes. Surely there is an easy way to convert these mixed bytes to
> Unicode? Below is 2 (of many) attempts at doing the conversion. I was
> expecting that Encoding.Convert would be able to do this. My HTML charset,
> session codepage, locale, thread culture are all set correctly for Japanese.
> (reading Japanese from a unicode file works).
>
> Attempt 1:
> Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
> FileAccess.Read, FileShare.None)
> Dim bytUTF8(Fs.Length) As Byte
> Fs.Read(bytUTF8, 0, bytUTF8.Length)
> bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
> Response.Write(Encoding.Unicode.GetString(bytUni))
>
> Attempt 2:
> reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
> System.Text.Encoding.UTF8, True)
> bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEnd())
> bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
> lblMessage.Text = Encoding.Unicode.GetString(bytUni)
>
> In ASP3 I had to pass the text through ADO to do the conversion which was
> very ugly to do - surely that is not required now?

No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!

Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.

You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see
http://www.pobox.com/~skeet/csharp/readbinary.html

For more information about Unicode issues, see
http://www.pobox.com/~skeet/csharp/unicode.html
http://www.pobox.com/~skeet/csharp/debuggingunicode.html


--
Jon Skeet - <skeet@xxxxxxxxx>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
.



Relevant Pages

  • Re: CString and UTF-8
    ... It is inconsistent, ... I cannot compile to Unicode. ... you can only handle Japanese on ... I'm not using UTF-8 for processing. ...
    (microsoft.public.vc.mfc)
  • Read UTF8 (mixed byte) file & convert to Unicode
    ... I have a file which has no BOM and contains mostly single byte chars. ... take the resulting Unicode and store it in a DB and display it onscreen. ... Below is 2 attempts at doing the conversion. ... session codepage, locale, thread culture are all set correctly for Japanese. ...
    (microsoft.public.dotnet.general)
  • Re: Characters not being displayed when language settings are changed.
    ... occur only in Japanese fonts, and Unicode and non-Unicode CJK fonts ... I have been generating a report which contains characters like $B-!(B and $B-"(B. ... I tried this on my machine too by going to Regional and Language Settings ...
    (microsoft.public.word.docmanagement)
  • Re: Best encoding for a Japanese web site to deliver?
    ... >I'm sure there is plenty of discussion of this issue in Japanese; ... Is there any down side to shipping Unicode, ... LiveJournal springs to mind as a site which is entirely done in UTF-8, ... The only times I've seen a problem with Unicode was where the charset ...
    (sci.lang.japan)
  • Re: Read UTF8 (mixed byte) file & convert to Unicode
    ... >> I have a file which has no BOM and contains mostly single byte chars. ... >> are numerous double byte chars (Japanese) which appear throughout. ... >> take the resulting Unicode and store it in a DB and display it onscreen. ... > string, ...
    (microsoft.public.dotnet.general)