Dangerous behavior of CString



Hi All:

I recently converted a moderate sized VC6 MFC application to "Unicode". This app had been written (by me!) with some attention to TCHAR, _T macro, and all that, but not very carefully. On initial compilation under Unicode, there were several hundred errors, and it took me a couple of days to get rid of them. But I was thankful that C++ is a strongly typed language...

I then started to test my app with strings from different languages (Russian in particular), and was surprised to find that in some places the strings were displayed correctly, but in others they were not. It turned out that the ones that failed were all cases where I had concatenated strings. Consider the following:

CString str(_T("Hello "));
str += "world.";		// 8-bit string!!
AfxMessageBox(str);

In non-Unicode build this obviously works fine, but in Unicode build it seems that it should not compile. But it does! Reason: There are two versions of operator +=

const CString& CString::operator+=(LPCTSTR lpsz);
const CString& CString::operator+=(const CString& string);

and in a Unicode build the above code uses the second one, by means of the conversion constructor

CString::CString(LPCSTR lpsz);

using the current code page to do the conversion.

Unfortunately, in the Unicode version of my app I do not want to use the current code page. I have 8-bit strings in my business logic that are UTF-8 and 16-bit strings in my Windows code that are UTF-16. I need to write the above code as

CString str(_T("Hello "));
str += CU2T("world.");
AfxMessageBox(str);

where CU2T converts a UTF-8 LPCSTR to LPTSTR (UTF-16 in Unicode build). Of course in this case it doesn't make any difference, because "world" is the same in UTF-8 as in the current code page (at least on my machine), but if it were a general UTF-8 string the original version would not work correctly. This is why my Russian UTF-8 strings were displayed as garbage.

Thus the implicit conversion constructor prevents the compiler form telling me that my code is not as I intended.

Yet another reason not to design classes with implicit conversion features, and yet another reason not to use CString. This would not have happened to me if I had used std::basic_string<TCHAR> typedef'd as tstring.

Does anybody use these implicit 8-bit <---> 16-bit conversion features in CString? Until recently I did not know they existed, and I certainly would prefer they did not.

David Wilkinson






.



Relevant Pages

  • Re: Dangerous behavior of CString
    ... If I'm reading a data file or serial port or something, if the raw data are multibyte but the compilation is Unicode or vice-versa, then sometimes the converting constructors in CString are convenient. ... I did not actually write code like this; in fact I was pretty careful always to use the _T macro with any literal strings. ... But it does the conversion using the current 8-bit code page, which is not what I want. ...
    (microsoft.public.vc.mfc)
  • Re: Sets and portability (was) Re: Is ISO Pascal compatible with J&W (original) Pascal ?
    ... strings, the user can control the length by the data they process; ... >> The computer world is more complex than it's ever been (eg Unicode) ... The Pascal `Char' type can be this size (unlike C, ... > Note that ansi->wide conversion is codepage sensitive. ...
    (comp.lang.pascal.misc)
  • Re: Want Input boxes to accept unicode strings on Standard Window
    ... strings with _T ... pattern) but these blow up immediately. ... as a "massive effort" or, in one case, "we need a complete rewrite in Unicode and can't ... the process a couple of times the conversion thing is pretty academic. ...
    (microsoft.public.vc.mfc)
  • Re: Copying string to byte array
    ... >> to Unicode using the code page in effect at the time. ... then the conversion cannot be guaranteed transparent. ... Binary data is not even fictional ... VB strings cannot contain arbitrary binary data. ...
    (microsoft.public.vb.general.discussion)
  • Re: Multi-Byte Application with XP Theme
    ... rest of my app is Unicode. ... It's a pain to convert your strings to Ansi to ... call the methods of the TinyXML parser class and convert back to Unicode ...
    (microsoft.public.vc.mfc)

Quantcast