Re: DB2 UTF-8 ODBC double conversion

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Essentially, from an *information* content perspective,
UTF-8 == UTF-16[LB]E == UTF32[LB]E

Unicode considers the various UTFs flavors completely equivalent.
Just various encoding forms for the same thing.
Unicode itself is not 32 bit, or 8, or anything.
It is just a mapping from characters to numbers plus a collection of
character properties.


If I'm interfacing to a database system, I need to know
what *its* native representation of a string is.
If it stores 8-bit strings, I have to
store a UTF-8 encoding if I want to retain the information content
of the original string.
Of course, this means that everyone who is using that database has
to understand that the
strings are being stored in UTF-8.

I would argue that I should not have to care about the internal encoding
of the database.
The correct types used should be NCHAR, NVARCHAR and NTEXT.
The public API should take UTF-16 or UTF-32 or UTF-8 and document it.
Any conversion between the public API text representation and the internal
format should be transparent.

Also the database should be aware that text stored is Unicode, and not
just a bunch of bytes.
Becase otherwise things like sorting (and functions like between),
case-insensitive searching, functions like substring, replace, like,
% (one or more characters), _ (one character), will not do the right thing.


Stuff can be move around without awarenes of what is in there, but one has to
be very careful what operations are save and wich ones are not
(pretty much liks storing utf-8 in CString).


So whether I use UTF-8 or UTF-16 with surrogates, my field has to be 320
bytes in length.

100% agree.


Note that it is not just dead languages like
Cuneiform that are in that region, so are you going to try to explain to
someone whose native language is expressed in the range > 65535 that
they can't use your database to represent as many characters as
someone who is using, say, American English

To make it more real: characters beyond BMP (Basic Multilingual Plane)
are required in order to support the GB-18030 Chinese National standard.
And the standard is enforced. If you want to sell your software in China,
you have to get a GB-18030 certification, or you don't sell it.

Also national standard in Japan and Hong Kong require support for
characters above U+FFFF. Although the standard are not enforced like the
Chinese one, supporting them might give you an extra edge in a competitive
market.

So beyond BMP it is not about some extinct languages that only few
archeologists care about.


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

.



Relevant Pages

  • Re: Help me!! Why java is so popular
    ... Well, Unicode is not a storage encoding system, or anything like that. ... Unicode is primarily a mapping from characters (in the linguistic conceptual ... French, Russian, Japanese and Korean songs. ...
    (comp.lang.java.programmer)
  • Re: utf8 and ftplib
    ... It opens a new local file using utf8 encoding and then reads from a file ... characters from the source file (e.g. foreign characters, ... Is there any way that I can correctly retrieve a utf8 encoded file from an FTP server? ... to be decoded to unicode on being read later. ...
    (comp.lang.python)
  • Re: TCHAR string?
    ... According to Microsoft's documentation the 'A' functions are "ANSI" ... although Unicode is not itself an ISO standard; ... just as much an ISO encoding as any of the ISO encodings ... Windows) *was* to be able to represent any of the characters of the ...
    (microsoft.public.vc.mfc)
  • Re: Unicode support in Smalltalk
    ... Characters 128-255, as they mean both "the bytes 128-255 used in the ... encoding of a String" and "the Unicode Characters whose code points are ... Characters represent the encoding, UnicodeCharacters represent, well, ... EncodedString class that holds explicitly the encoding, ...
    (comp.lang.smalltalk)
  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)