Re: CString and UTF-8




"David Wilkinson" <no-reply@xxxxxxxxxxxx> wrote in message
news:ej7sV%231IHHA.536@xxxxxxxxxxxxxxxxxxxxxxx
Sunray wrote:

It does, CString will try to understand the multibyte characters as its
based upon the CRT which does that. I refer the reader to the
internationalisation section of MSDN which I spent all day reading
yesterday. If it didn't why would my code work in English version of
Windows and not in the Japanese version.

The locale it is in determines how this behaves. I suggest you try
installing a machine that isn't a standard locale and see how CString
behaves before saying what you have said. My code example does *exactly*
what you are saying it doesn't do. CString treats the UTF-8 characters
as multibyte characters, there are two chinese characters in there.
Unfortunately because of the code page the way it does this is wrong
causing it to miss the quotes. What its doing is only useful if I am
trying to process just a multibyte string. What I am processing is a
ASCII string with UTF-8 embedded into it, delimited with quotes, the
UTF-8 does not have these in, its a pre-condition, so it trying to do
this is irrelevant and clearly unhelpful in this instance.

Sunray:

Not sure what you are saying here. Does CString::GetLength() in ANSI build
just return the number of bytes or not? In my understanding, it does,
regardless of whether the string is a valid UTF-8 string, local MBCS
string, or invalid. [Likewise in Unicode mode it just returns the number
of wchar_t's, not the number of characters.]

I must say I was always very confused by this, because the CString
documentation somehow implies that CString is MBCS-aware. But apart from
conversion constructors to UTF-8 and esoteric features of
CString::Compare() I don't think it is.

David Wilkinson

It always appears to return the length in bytes of the string. It wouldn't
make sense to do anything else because you'd lose information if you wanted
to process the chars yourself.

What seems to be a problem for me is when I use the Find function on a 932
code page machine with UTF-8 characters in the string. It appears to be
searching through the string in a MBCS way and with the string of characters
I provided (which I know is missing the terminator) it's expecting Japanese
characters which means it skips the quotes. Perhaps its expecting another
byte to complete a Japanese character. The character set it is using for
this is not UTF-8. If it was, that string comprises valid UTF-8 chinese
characters.

Setting the LC_CTYPE for the code page to 1252 and then getting the
multibyte code page stops this behavour. I'm hoping that this will solve a
lot of problems for me.

Alex


.



Relevant Pages

  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • Re: Writing Japanese or Chinese strings in a text file
    ... characters on the screen. ... start of the file that flags the data as UTF-8. ... VB uses Unicode internally, for 'String' data in memory. ... So they are right in the excel file. ...
    (microsoft.public.vb.general.discussion)
  • Re: Optimization of code
    ... It would also have been nice if the notion of derviation from CString had been a supported ... but still return the intended "formatted" string. ... if the editor supports Unicode and the compilers support ... Swedish, German, French, Hungarian, etc. that use accented characters). ...
    (microsoft.public.vc.mfc)
  • Re: Unicode string libraries
    ... it comes to sequences of characters? ... I know that Perl uses UTF-8 as its internal string representation. ... Ruby just didn't do Unicode. ...
    (comp.programming)

Loading