Re: _MBCS
- From: "Eugene Gershnik" <gershnik@xxxxxxxxxxx>
- Date: Thu, 23 Feb 2006 11:04:42 -0800
Jochen Kalmbach [MVP] wrote:
Hi Eugene!
You need to be careful with the word "use". Linux uses UTF-32 for
wchar_t and so do other platforms I mentioned (on Solaris this is so
only under UTF-8 locales). Linux (and others) also use UTF-8 as the
*MBCS* encoding because Unix history pretty much precludes any other
way to Unicode-ize applications and OS.
I thought, wchar_t depends on the compiler not on the OS...
It very much depends on the OS. You need to go to OS facilities to do
character conversions (unless your compiler bundles all conversion tables
with your code). In general all the locale-related functions that operate on
characters need to use the user's locale which binds you to the OS.
The language where char does not depend on the OS is called Java. The price
you pay for that is about ~50 MB JVM that you need to bring on your machine
to run it. A significant part of this goes into conversion tables and other
locale info.
And gcc uses "int"
It has to use 32 bits to store UTF-32. How is the type implemented is up to
the compiler. An "int" works sort of fine as long as it is 32-bit. A
conforming C++ compiler has to have a spearate type (not a typedef) called
"wchar_t".
Have you a real example?
Sure. For example you may want to output a very long string and
truncate it with ... after a given length of characters.
This is exactly the example in which it does not matter. Here you
should separat on glyph boundaries and not on codevalue-boundaries.
In your GUI app -- certainly. On a server I don't have any reasonable way to
deal with glyphs, fonts and other similar things.
If it does not matter if the last character is wrong, then you can
split inside a character sequence.
And break all comparison and conversion routines out there ;-). You cannot
break inside a codepoint because doing so will produce an invalid string.
Such string will not be accepted by pretty much anything else.
But this can also be done with UTF-16, because you can simply see if
you are in surrogate-pair...
Sure. Whcih brings us back to "smart" iteration which we wanted to avoid in
the first palce ;-) Not a huge improvement over MBCS.
By the way UTF32 is a huge wast of memory...
So is UTF-16. ;-) Seriously it all depends on what you are doing. If you
need fast iteration UTF-32 is ideal. If you need to save on storage UTF-8 is
the way to go (and please let's not go into debate whether it will save in a
corner case. For practical cases I have to deal with it always has smaller
size than either UTF-16 or UTF-32. YMMV).
--
Eugene
http://www.gershnik.com
.
- Follow-Ups:
- Re: _MBCS
- From: Jochen Kalmbach [MVP]
- Re: _MBCS
- References:
- Prev by Date: Re: memory leak
- Next by Date: Re: Visual C++ optimizations and inline assembly
- Previous by thread: Re: _MBCS
- Next by thread: Re: _MBCS
- Index(es):
Relevant Pages
|