Re: Want Input boxes to accept unicode strings on Standard Window

Tech-Archive recommends: Fix windows errors by optimizing your registry



"David Wilkinson" <no-reply@xxxxxxxxxxxx> wrote in message
news:eJWXQErzHHA.4712@xxxxxxxxxxxxxxxxxxxxxxx
David Ching wrote:

Ah, UTF-8. I know you discussed this at length several months ago here,
but to be honest, this is my understanding of it: it is an 8-bit
encoding scheme no different than Ansi (that's how it fits in 8 bits).
Since it is 8-bits, it cannot specify everything a LPWSTR can. Yet it is
somehow is supposed to be better than Ansi, not reliant on any codepage.
But if it's only 8 bits, how is that?

And UTF-8 begs the question about UTF-16. Is UTF-16 the same as what
Windows Notepad (in the Save As dialog) calls "Unicode"? Or is Windows
concept of Unicode and LPWSTR different than UTF-16?

David:

Both UTF-8 and UTF-16 are complete encodings of Unicode. UTF-8 uses up to
four 8-bit characters, and UTF-16 uses up to two 16-bit characters.

Yes, thanks. For some reason I had thought UTF-8 was SBCS (since it was 8
bits) and not MBCS. Even Ansi codepage is MBCS, so UTF-8 and Ansi are
really different scheme for the same idea. Makes sense now! :-)


When "Windows Unicode" first started out, all code points could be
represented by one 16-bit code unit, but no longer. Modern Windows Unicode
*is* UTF-16. The Windows ANSI code pages are (I think) all DBCS, so UTF-8
cannot be used as a code page (at any rate, it is not the ANSI code page
for any language).

Some say, and I agree, that now there are surrogate pairs in UTF-16, it
holds no advantage over UTF-8.

Not to offend anyone, but I recently developed a small product in 30
languages. The languages were selected to match the ones where Windows had
a native SKU. UTF-16 was fine for this, we never worried about surrogate
pairs. I had understood surrogate pairs were only used for a few Han
dialects in Chinese, and perhaps a couple other languages, but they weren't
mainstream by any means. How long before UTF-16 *really* does not work for
all practical purposes?


Many Linux systems use UTF-8 as their native encoding, but this will never
happen in Windows.


The way you've explained UTF-8, it has all the disadvantages of MBCS (in
fact it is a MBCS) and is thus very hard to parse. I'm not sure why any
modern OS would want to be built internally on it.


This does not mean that a Windows program cannot use UTF-8 internally. In
fact the whole back end of my application uses UTF-8. XML serialization is
just one of the things this back end does.


I take it STL string is UTF-8 friendly? ;) Seriously,what library to use
to represent UTF-8 in memory? I understood STL string (often typedef'd to
be tstring) is just a UTF-16 string like CStringW. I did not see any UTF-8
capable string that is widespread. What are you using?

Thanks,
David


.



Relevant Pages

  • Re: RfD: XCHAR wordset
    ... It's somewhat worse, because Windows has "A" prototypes, which convert the ... current code page into UTF-16 on the fly. ... Actually, it might be possible to change the current code page to UTF-8, but ... Windows strings are usually not C strings, ...
    (comp.lang.forth)
  • Re: Want Input boxes to accept unicode strings on Standard Window
    ... Yet it is somehow is supposed to be better than Ansi, ... Is UTF-16 the same as what Windows Notepad calls "Unicode"? ... Both UTF-8 and UTF-16 are complete encodings of Unicode. ...
    (microsoft.public.vc.mfc)
  • Re: AfxMessageBox?
    ... I also like to use UTF-8 for XML. ... to MFC to support this sort of thing. ... I know there are different kinds of UTF-16:o) ... Mihai Nita [Microsoft MVP, Windows - SDK] ...
    (microsoft.public.vc.mfc)
  • Re: Support for UTF-16 on Solaris
    ... whereas with UTF-16 you may find yourself having to reinvent the wheel. ... which is interesting on Windows but probably less interesting on Solaris ... (where, for instance, it may make sense to use UTF-8 as the native format). ...
    (comp.unix.solaris)
  • Re: unicode file
    ... and if is ansi how can i convert it to unicode ... If there is a BOM, the file is treated as UTF-8 or UTF-16LE ... When a file is opened for writing using _O_WTEXT, UTF-16 ...
    (microsoft.public.vc.mfc)