Re: UTF-8 encoding in AJAX web application.



Thanks for your reply Allan, and also thanks for Jon's input.

Actually, the knowledge that necessary to explain the problem here is
specific to string/text charset/encoding.


Let me try to answer the questions:


#First, both UTF-8, UTF-16, UCS-2.. are one of the encoding schema of
Unicode charset. In other words, unicode character string can be encoded
into binary stream through either of these ones.




Is this done by detecting the UFT8 preamble? And the the driver converts to
UCS-2? And if so how come the result is still in UTF-8 when I retrieve the
data again?
Or is there no conversion?
======================
First, UTF-8 is the encoding that your web page and client browsesr used to
transfer the unicode characters. This is because UTF-8 is multiple-byte
encoding schema, it will has compressed size and improved performance if
the transfered data mostly contain ASCII characters(since UTF-16 or UCS-2
will always use two bytes to represent a character). And when your .net
code has successfully get the unicode string, it has already been converted
to UTF-16(unicode encoding) because .net always to two-byte Unicode
encoding to represent characters in memory. And when you use ADO.NET to
submit string/characeters data to SQL Server database.

At SQL Server side, it simpy receive the unicode characters from client,
and store them into the target table column. Here problem may occur depend
on the column's Charset type, if it is of unicode type(e.g. nvarchar,
nchar, ntext ...), SQL Server can store them correctly (persisted as UCS-2
encoding). If the column is not of uncode char type, it will use the
column/table/database's current collation (charset) to encoding the unicode
characters into binary stream(such charset is usually a multi-byte charset).







Why is it important that MSSQL only supports UCS-2 unicode if everything
works fine with UTF-8?
I can see that everything works fine when storing a UTF-8 string in an
ntext
column, and when I query the data in queryanalyzer the string is displayed
correctly in the result set, how can this be if MSSQL only supports UCS-2
encoding?
How is the string stored? in UTF-8 or UCS-2?
===========================
UTF-8 is good at data compression if the data mainly contains non-wide
chars(ASCII chars), but it is less efficient than UTF-16, UCS-2 because it
use different number of bytes to encode different characeters while UTF-16
and UCS-2 always use two bytes for each character. And SQL Server 2000 is
an old product, UCS-2 is preferred at that time. Anyway, there is no really
true or false on which encoding to choose for SQL Server here.

Therefore, for any unicode text column, they'll be persisted in UCS-2
encoding(in memory or data file). And when client application query these
data out, they can be correctly retrieved and processed as long as the
client application support Unicode. For example, .net framework can
correctly handle uncode chars and unicode chars are stored as UTF-16
encoded format in memory.



In addition, here is a good reference on MS globaldev site introducing
charset/encoding:

#Globalization Step-by-Step
http://www.microsoft.com/globaldev/getWR/steps/wrg_codepage.mspx

Not sure whether I've missed anything in your former reply, if you have any
further specific questions on this, please feel free to post here.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead


This posting is provided "AS IS" with no warranties, and confers no rights.





.



Relevant Pages

  • Re: PEP 263 status check
    ... Unicode, and things like that. ... > that an 8-bit string contains one byte per character ... ensure that it doesn't do any fancy charset handling, ... often does not know which character set to convert it to. ...
    (comp.lang.python)
  • Re: HELP: Unicode in Java 1.3.1 vs 1.4.2
    ... > Unicode with one of the String constructor methods. ... > convert a String object into a byte array of non-Unicode characters ... You are not using the canonical name of the charset, ... String then it is already Unicode, ...
    (comp.lang.java.programmer)
  • Re: File names, character sets and Unicode
    ... Possibly UTF-8 encoded Unicode, who knows. ... For instance zh_TW is BIG5 charset by default, ... via $LANG, there is no guarantee that all file names are encoded this way. ... byte string file name, but keep the byte string for operations on the ...
    (comp.lang.python)
  • Re: Whats the reverse of Py_BuildValue("u#" ?
    ... PyObject representing a string? ... You can use PyObject_Unicode(o) to convert the object to Unicode first, ... be UCS-2 only if Py_UNICODE is 16 bits on your platform. ...
    (comp.lang.python)
  • Re: Tranfering unicod charcters in Socket programming!
    ... You are telling about conversion b/w MBCS to Unicode. ... If this is not possible Shall I try with string to wstring ... int SendStringAsUnicode ...
    (microsoft.public.win32.programmer.networks)

Loading