Re: wstring to ostream



On Thu, 21 Dec 2006 06:49:20 +0100, Howie Meier <howieh@xxxxxxx>
wrote:

I thought only wstring can store UTF-8. How can i convert the UTF-8 ?
Have you got an example or link etc. ?

Thanks Mr.Asm.

Howie

There are different encodings for Unicode characters; UTF-8 and UTF-16
are two examples of Unicode character encodings.

In UTF-8 encoding, a Unicode character can be stored in 1 or 2 or 3 or
even 4 bytes. The base unit in UTF-8 are 8 bits (1 byte).
The UTF-8 is so a pure sequence of bytes (back-compatible with English
ASCII characters, i.e. e.g. no italian vowels with accents like è or
é.)

In UTF-16 encoding, a Unicode character can be stored in one or two
"code unit"; a code unit in UTF-16 is always 16 bits (2 bytes). A lot
of characters need only one UTF-16 code unit (i.e. 2 bytes), but there
are also some characters (Chineese characters, some musical symbols,
characters from ancient alphabets, etc.) that can have so called
"surrogates", and they need two code units (2 x 16 bits).

So, while a UTF-8 encoded string is a sequence of chars (or CHARs in
Win32 SDK), a UTF-16 encoded string is a sequence of wchar_t's (16
bits, or WCHARs in Win32 SDK).

So, you can store a UTF-8 encoded string in std::string (or even
std::vector< char >), and a UTF-16 string in std::wstring (or even
std::vector< wchar_t >).

To convert from UTF-8 to UTF-16, you can use the
::MultiByteToWideChar() Win32 function; to convert from UTF-16 to
UTF-8, you can use the ::WideCharToMultiByte() Win32 function.
Use CP_UTF8 as code page (this is one the parameters required by these
functions).

If you want more details, you could also search on
microsoft.public.vc.mfc, when Unicode has been discussed recently,
too.

I would also suggest the following links:

<http://en.wikipedia.org/wiki/UTF-8>
<http://unicode.org/unicode/faq/utf_bom.html#UTF8>
<http://www.unicode.org/unicode/faq/>

Hope this helps.

Mr.Asm
.



Relevant Pages

  • Re: UTF-16 file input, C programming.
    ... However, you are only partly correct, from the fact that all standard ASCII chars, are mapped on a single byte as you mention. ... UTF-8 only maps the standard ASCII chars in one byte and anything above is represented in two or more bytes. ... I believe unicode.org has some source, providing functions, that can convert UTF-16 surrogate pairs, into UTF-8 multibyte characters, but I will have to look into that. ...
    (comp.unix.programmer)
  • Re: What kind of unicode?
    ... developers are aware that a single Unicode character can span several AnsiChar, while many forget that the same is true for WideChar. ... And even when dealing with Chinese XML files, the resulting size is comparable or smaller in UTF-8 than in UTF-16. ... - UTF8String and UTF16String would not be arrays of characters, but of Byte and Word. ...
    (borland.public.delphi.non-technical)
  • Re: unichr() question
    ... Unicode code points. ... If you eventually need UTF-8, you might just as well create a mapping ... Recent Unicode revisions added characters beyond the first ... If you want to learn more about UTF-16, ...
    (comp.lang.python)
  • Re: The Register interview Nigel Brown
    ... performance isn't quite as good as string. ... Have you considered implementing a native UTF-8 ... than UTF-16 with European ... which does not include all Chinese characters. ...
    (borland.public.delphi.non-technical)
  • Re: Supporting full Unicode
    ... > Keeping in mind that in UTF-16 some characters take two bytes and ... It is true that variable-width encodings such as UTF-16 or UTF-8 are ... But UTF-8 is gaining momemtum. ... encoding only, it is now in use as an internal encoding, too. ...
    (comp.lang.ada)