Re: get wide character and multibyte character value
- From: Zapanaz <http://joecosby.com/code/mail.pl>
- Date: Sun, 27 Jan 2008 14:17:34 -0800
On Sun, 27 Jan 2008 00:26:00 -0800, George
<George@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
It is what UTF-16 says, not what Windows says...
How do you think the characters which requires more than 16-bit in UTF-16,
could be represented by wide character on Windows wchar_t (which is defined
to unsigned short, 16-bit). So, it is why I think Windows has limitations to
represent all UTF-16 characters with wchar_t (uinsigned short).
Any comments?
I wrote a long response to this but it was something like ten pages by
the time I got done with it ... I'll try for a shorter version.
When 16-bit Unicode was first being designed, they thought 16 bits
would be sufficient to represent all characters. It wasn't, so later
they reserved a block of the available 16-bit values to indicate
surrogate pairs. (I have spent time on the Unicode Consortium web
site, and the total number of different characters they represent is
incredible. I saw a Unicode block for "Royal Aramaic", a script used
in the middle east thousands of years ago and AFAIK not used commonly
since. As Norbert put it, their goal is to represent every script "on
earth and beyond". There are probably unicode definitions for Klingon
and the langauges from The Lord of the Rings.)
As I understand it, in Windows they first used 16-bit values, then
later added support for surrogate pairs, so what you see could depend
on your version of Windows, and your version of Visual Studio. I am
using a fairly old version of the compiler, 6.0, and as far as I know
the compiler doesn't know about surrogate pairs. To be honest, until
I get around to upgrading, I am going to allow this into my code.
That's not ideal, but with nearly 2^16 values covered I think the
likelihood of problems is very low. If my users have a lot of MP3's
with Royal Aramaic song titles, I may run into problems.
But if you want to be absolutely sure your application will work
correctly, you should account for surrogate pairs. As Giovanni and
David said, Windows definitely does use surrogate pairs.
UTF-8 on the other hand was designed from the start to be extensible,
so if my code works on char values > 2^7, it should work on anything.
Most of my code is handling text as UTF-8. Anywhere I have to
interact with the OS libraries or the file system though, I have to
handle 16-bit. If you are going to store data out of your app,
personally I would recommend storing it in UTF-8. All in all, it
seems to be the most accepted standard.
I suppose the simplest answer is to program in Java (no language flame
war intended :^) which handles text as UTF-8 throughout. It's pretty
rare you have to think about character encoding at all. But
personally I am leery of writing a commercial application in Java for
a number of reasons, so I am working with Windows.
For applications I am writing for myself or that will be used in-house
though, I would usually use Java instead.
--
Joe Cosby
http://joecosby.com/
I saw the show under unfortunate circumstances:
the curtain was up.
- George S. Kaufman
:: Currently listening to Buena Vista Social Club, 1997, by Buena Vista Social Club, from "Buena Vista Social Club"
.
- References:
- Re: get wide character and multibyte character value
- From: Zapanaz
- Re: get wide character and multibyte character value
- From: George
- Re: get wide character and multibyte character value
- From: Norbert Unterberg
- Re: get wide character and multibyte character value
- From: George
- Re: get wide character and multibyte character value
- Prev by Date: Re: get wide character and multibyte character value
- Next by Date: Re: bad_alloc in new
- Previous by thread: Re: get wide character and multibyte character value
- Next by thread: Re: get wide character and multibyte character value
- Index(es):
Relevant Pages
|