Re: DB2 UTF-8 ODBC double conversion
- From: "Mihai N." <nmihai_year_2000@xxxxxxxxx>
- Date: Tue, 24 Nov 2009 00:49:58 -0800
Essentially, from an *information* content perspective,
UTF-8 == UTF-16[LB]E == UTF32[LB]E
Unicode considers the various UTFs flavors completely equivalent.
Just various encoding forms for the same thing.
Unicode itself is not 32 bit, or 8, or anything.
It is just a mapping from characters to numbers plus a collection of
character properties.
If I'm interfacing to a database system, I need to know
what *its* native representation of a string is.
If it stores 8-bit strings, I have to
store a UTF-8 encoding if I want to retain the information content
of the original string.
Of course, this means that everyone who is using that database has
to understand that the
strings are being stored in UTF-8.
I would argue that I should not have to care about the internal encoding
of the database.
The correct types used should be NCHAR, NVARCHAR and NTEXT.
The public API should take UTF-16 or UTF-32 or UTF-8 and document it.
Any conversion between the public API text representation and the internal
format should be transparent.
Also the database should be aware that text stored is Unicode, and not
just a bunch of bytes.
Becase otherwise things like sorting (and functions like between),
case-insensitive searching, functions like substring, replace, like,
% (one or more characters), _ (one character), will not do the right thing.
Stuff can be move around without awarenes of what is in there, but one has to
be very careful what operations are save and wich ones are not
(pretty much liks storing utf-8 in CString).
So whether I use UTF-8 or UTF-16 with surrogates, my field has to be 320
bytes in length.
100% agree.
Note that it is not just dead languages like
Cuneiform that are in that region, so are you going to try to explain to
someone whose native language is expressed in the range > 65535 that
they can't use your database to represent as many characters as
someone who is using, say, American English
To make it more real: characters beyond BMP (Basic Multilingual Plane)
are required in order to support the GB-18030 Chinese National standard.
And the standard is enforced. If you want to sell your software in China,
you have to get a GB-18030 certification, or you don't sell it.
Also national standard in Japan and Hong Kong require support for
characters above U+FFFF. Although the standard are not enforced like the
Chinese one, supporting them might give you an extra edge in a competitive
market.
So beyond BMP it is not about some extinct languages that only few
archeologists care about.
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
.
- Follow-Ups:
- Re: DB2 UTF-8 ODBC double conversion
- From: Joseph M . Newcomer
- Re: DB2 UTF-8 ODBC double conversion
- References:
- DB2 UTF-8 ODBC double conversion
- From: Mihajlo Cvetanović
- Re: DB2 UTF-8 ODBC double conversion
- From: Tom Serface
- Re: DB2 UTF-8 ODBC double conversion
- From: Giovanni Dicanio
- Re: DB2 UTF-8 ODBC double conversion
- From: Mihajlo Cvetanović
- Re: DB2 UTF-8 ODBC double conversion
- From: Mihajlo Cvetanovic
- Re: DB2 UTF-8 ODBC double conversion
- From: Mihai N.
- Re: DB2 UTF-8 ODBC double conversion
- From: Joseph M . Newcomer
- Re: DB2 UTF-8 ODBC double conversion
- From: David Wilkinson
- Re: DB2 UTF-8 ODBC double conversion
- From: Tim Slattery
- Re: DB2 UTF-8 ODBC double conversion
- From: Joseph M . Newcomer
- DB2 UTF-8 ODBC double conversion
- Prev by Date: Re: DB2 UTF-8 ODBC double conversion
- Next by Date: Re: How to skip & in the Menu string
- Previous by thread: Re: DB2 UTF-8 ODBC double conversion
- Next by thread: Re: DB2 UTF-8 ODBC double conversion
- Index(es):
Relevant Pages
|