Re: Unicode/UTF-8 decoding

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Bill Nguyen wrote:
I set UTF-8 as the default encoding in mySQL.
I don't really know how this work, but IE or Firefox browser can decode easily.
This is the test:
I put the lines below in an HTML document and viewed it in IE, and it worked. (make sure to set encoding to UTF-8 in VIEW).
I include the test.htm for your testing. (The text is in Vietnamese).
So I think what I need is to find a utility that has the same function that might already be available out there. Any help is greatly appreciated.

Bill

----------------
<html>

<head></head>

<body>



Virginia Hamilton Adair / Lâm Thị Mỹ Dạ
Lấp lánh há»"n thÆ¡ Việt trên sân ga Tokyo chiều cuối năm


</body>

</html>



"Göran Andersson" <guffa@xxxxxxxxx> wrote in message news:%23rhR3M0pHHA.1776@xxxxxxxxxxxxxxxxxxxxxxx
Bill Nguyen wrote:
Below are sometext I extracted from a mySQL database. How can I decode them so that I can read them in Unicode?
Thanks

Bill

------------

Virginia Hamilton Adair / Lâm Thị Mỹ Dạ
Lấp lánh hồn thơ Việt trên sân ga Tokyo chiều cuối năm

This text looks as it has been decoded with a different encoding than was used to encode it. It might be possible to recreate the data if you know what encodings was used to encode and decode it. Then you might be able to encode it back to it's prevois state and use the proper encoding to decode it. There is a great risk that some data has been lost, though, and that you can't recreate the original data from this stage.

If you want to store unicode strings in the MySQL database, it has to be set up to use unicode as character set.

--
Göran Andersson
_____
http://www.guffa.com

------------------------------------------------------------------------

> Virginia Hamilton Adair / Lâm Thị Mỹ Dạ > Lấp lánh hồn thơ Việt trên sân ga Tokyo chiều cuối năm

You are doing exactly what I was talking about. If you read the data using the wrong encoding, then save it using the same encoding, you can then open it using the corrent encoding, provided that the process hasn't removed any data.

If you have set up your MySQL database to use unicode, and still get the string out in that manner, the error is before you even saved the string in the database in the first place. What you have done is basically:

unicode -> bytes -> wrong encoding -> MySQL -> wrong encoding -> html -> bytes -> browser -> unicode

While this gives the correct result for some strings, some byte codes used in UTF-8 doesn't represent a single character by themselves, so if you contine to store mis-decoded strings as unicode, you will sooner or later experience corrupted strings.

--
Göran Andersson
_____
http://www.guffa.com
.



Relevant Pages

  • Re: diferences between 22 and python 23
    ... >if strings had an encoding attached. ... >I would use a Unicode object to represent these characters. ... ISTM str instances seem to be playing a dual role as ascii-encoded strings ...
    (comp.lang.python)
  • Re: regular expressions and the LOCALE flag
    ... Strings with the 'u' prefix are Unicode strings, ... to be explicit, if the local encoding is 'utf8', none of the following will get a hit: ... Characters are categorised according to the ...
    (comp.lang.python)
  • Re: diferences between 22 and python 23
    ... >> encoding attribute. ... I was being sloppy and using "unicode" as ... The point being to preserve character identity information from the original ... What would be the meaning of concatenating strings, ...
    (comp.lang.python)
  • Re: Unicode/UTF-8 decoding
    ... I am using SQLyog to access mySQL remotely. ... This text looks as it has been decoded with a different encoding than ... If you want to store unicode strings in the MySQL database, ... While this gives the correct result for some strings, ...
    (microsoft.public.dotnet.languages.vb)
  • Re: Python 3.1.1 bytes decode with replace bug
    ... In the original example I decoded to UTF-8 and in this example the ... The problem in your original example, and in the current one, is not in decode(), but in encode, which is implicitly called by print, when needed to convert from Unicode to some byte format of the console. ... But since you're running in a debugger, there's an implicit print, which is converting unicode into whatever your default console encoding is. ...
    (comp.lang.python)