Re: _MBCS

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Jochen Kalmbach [MVP] wrote:
Hi Eugene!
You need to be careful with the word "use". Linux uses UTF-32 for
wchar_t and so do other platforms I mentioned (on Solaris this is so
only under UTF-8 locales). Linux (and others) also use UTF-8 as the
*MBCS* encoding because Unix history pretty much precludes any other
way to Unicode-ize applications and OS.

I thought, wchar_t depends on the compiler not on the OS...

It very much depends on the OS. You need to go to OS facilities to do
character conversions (unless your compiler bundles all conversion tables
with your code). In general all the locale-related functions that operate on
characters need to use the user's locale which binds you to the OS.

The language where char does not depend on the OS is called Java. The price
you pay for that is about ~50 MB JVM that you need to bring on your machine
to run it. A significant part of this goes into conversion tables and other
locale info.

And gcc uses "int"

It has to use 32 bits to store UTF-32. How is the type implemented is up to
the compiler. An "int" works sort of fine as long as it is 32-bit. A
conforming C++ compiler has to have a spearate type (not a typedef) called
"wchar_t".

Have you a real example?

Sure. For example you may want to output a very long string and
truncate it with ... after a given length of characters.

This is exactly the example in which it does not matter. Here you
should separat on glyph boundaries and not on codevalue-boundaries.

In your GUI app -- certainly. On a server I don't have any reasonable way to
deal with glyphs, fonts and other similar things.

If it does not matter if the last character is wrong, then you can
split inside a character sequence.

And break all comparison and conversion routines out there ;-). You cannot
break inside a codepoint because doing so will produce an invalid string.
Such string will not be accepted by pretty much anything else.

But this can also be done with UTF-16, because you can simply see if
you are in surrogate-pair...

Sure. Whcih brings us back to "smart" iteration which we wanted to avoid in
the first palce ;-) Not a huge improvement over MBCS.

By the way UTF32 is a huge wast of memory...

So is UTF-16. ;-) Seriously it all depends on what you are doing. If you
need fast iteration UTF-32 is ideal. If you need to save on storage UTF-8 is
the way to go (and please let's not go into debate whether it will save in a
corner case. For practical cases I have to deal with it always has smaller
size than either UTF-16 or UTF-32. YMMV).

--
Eugene
http://www.gershnik.com


.



Relevant Pages

  • Re: Loading a data file containing character fields with different encodings
    ... UTF-8 characters along with Latin-1 characters. ... One containing the latin-1 character set column, the second containing the utf-8 column and of course both files containing the primary key information. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ... This caused that no conversion was done, but you were puting CP1252 characters into an 819 database! ...
    (comp.databases.informix)
  • Re: Proposal to extend documentation about interop
    ... > utf-8 encoding of the character FF. ... > I solved it by doing the conversion of UTF-8 to bytes and when going back to ...
    (microsoft.public.dotnet.framework.interop)
  • Re: Proposal to extend documentation about interop
    ... use "String". ... Frankly, I have no idea what default conversion is used for "string", but it ... I don't care if Win32 has support for UTF-8, ... information about that there is a character conversion. ...
    (microsoft.public.dotnet.framework.interop)
  • Proposal to extend documentation about interop
    ... characters (where the utf-8 encoding becomes two bytes) such as едц the ... utf-8 encoding of the character FF. ... I solved it by doing the conversion of UTF-8 to bytes and when going back to ... Unicode to "Default", ...
    (microsoft.public.dotnet.framework.interop)
  • Re: ASCII discrimination
    ... I believe Mac OS 10 uses UTF-8 by default. ... that the function cutuses to read a character has been ... When the locale does not specify the UTF-8 encoding, ...   $ locale ...
    (comp.unix.shell)