Re: get wide character and multibyte character value




On Sun, 27 Jan 2008 00:26:00 -0800, George
<George@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

It is what UTF-16 says, not what Windows says...

How do you think the characters which requires more than 16-bit in UTF-16,
could be represented by wide character on Windows wchar_t (which is defined
to unsigned short, 16-bit). So, it is why I think Windows has limitations to
represent all UTF-16 characters with wchar_t (uinsigned short).

Any comments?



I wrote a long response to this but it was something like ten pages by
the time I got done with it ... I'll try for a shorter version.

When 16-bit Unicode was first being designed, they thought 16 bits
would be sufficient to represent all characters. It wasn't, so later
they reserved a block of the available 16-bit values to indicate
surrogate pairs. (I have spent time on the Unicode Consortium web
site, and the total number of different characters they represent is
incredible. I saw a Unicode block for "Royal Aramaic", a script used
in the middle east thousands of years ago and AFAIK not used commonly
since. As Norbert put it, their goal is to represent every script "on
earth and beyond". There are probably unicode definitions for Klingon
and the langauges from The Lord of the Rings.)

As I understand it, in Windows they first used 16-bit values, then
later added support for surrogate pairs, so what you see could depend
on your version of Windows, and your version of Visual Studio. I am
using a fairly old version of the compiler, 6.0, and as far as I know
the compiler doesn't know about surrogate pairs. To be honest, until
I get around to upgrading, I am going to allow this into my code.
That's not ideal, but with nearly 2^16 values covered I think the
likelihood of problems is very low. If my users have a lot of MP3's
with Royal Aramaic song titles, I may run into problems.

But if you want to be absolutely sure your application will work
correctly, you should account for surrogate pairs. As Giovanni and
David said, Windows definitely does use surrogate pairs.

UTF-8 on the other hand was designed from the start to be extensible,
so if my code works on char values > 2^7, it should work on anything.
Most of my code is handling text as UTF-8. Anywhere I have to
interact with the OS libraries or the file system though, I have to
handle 16-bit. If you are going to store data out of your app,
personally I would recommend storing it in UTF-8. All in all, it
seems to be the most accepted standard.

I suppose the simplest answer is to program in Java (no language flame
war intended :^) which handles text as UTF-8 throughout. It's pretty
rare you have to think about character encoding at all. But
personally I am leery of writing a commercial application in Java for
a number of reasons, so I am working with Windows.

For applications I am writing for myself or that will be used in-house
though, I would usually use Java instead.

--
Joe Cosby
http://joecosby.com/
I saw the show under unfortunate circumstances:
the curtain was up.
- George S. Kaufman

:: Currently listening to Buena Vista Social Club, 1997, by Buena Vista Social Club, from "Buena Vista Social Club"
.



Relevant Pages

  • Re: Tk 8.4.11 / Windows XP / Encoding problem
    ... Some of our clients are experiencing a weird problem on their Windows XP PCs. ... suddenly they start showing other characters instead of the correct utf-8 ... but it reported "utf-8" as it should. ...
    (comp.lang.tcl)
  • Re: Unicode Text on Linux and WindowsXP
    ... language-specific characters appear distorted. ... It is probably caused by an activated UTF-8 encoding in your Linux ... version of the Ext2 IFS software does not support UTF-8 encoded file ... (The driver always uses the current code page of Windows.) ...
    (Ubuntu)
  • [Full-disclosure] Re: What A Click! [Internet Explorer]
    ... > tell your windows to open .HTA files in notepad. ... > (since there are more ways to cover windows with malicious lookalikes). ... >> Using custom Microsoft Agent characters it is possible to cover any kind ... including security or download dialogs. ...
    (Full-Disclosure)
  • Re: Tk 8.4.11 / Windows XP / Encoding problem
    ... Some of our clients are experiencing a weird problem on their Windows XP PCs. ... suddenly they start showing other characters instead of the correct utf-8 ... Tcl usually does proper detection of the system encoding. ...
    (comp.lang.tcl)
  • Re: How many bytes per Italian character?
    ... yes I know how to ask Windows CE how big a buffer I'm going to need. ... and characters in Italian are the same size as characters in English, ... So have you detected that WinCE will cause a buffer overrun? ... It is not an error to tell you a string might be longer than it actually ...
    (microsoft.public.vc.mfc)