Re: Help with UNICODE.



If it was done without any consideration for Unicode, you have to do several things:

Find every occurrence of 'char' in your code. Unless demonstrably required to represent
8-bit characters, you must replace them with 'TCHAR'.

char -> TCHAR
char * -> LPTSTR
const char * -> LPCTSTR

then you must find all literal strings and put _T() around them, e.g.,

"cat" -> _T("cat")
'A' -> _T('A')

you must find and fix all occurences of any str-function in your code, e.g.,
strcpy -> tcscpy
strcmp -> tcscmp
etc.

However, this is complicated arbitrarily because the file tchar.h sometimes prefixes the
tcs version with an _, e.g., _tcsXXX for strXXX. There's no rhyme or reason to why this
is done, so you have to look at tchar.h to figure out which one to use.

Make sure you have not accidentally confused BYTE and char; BYTEs are always 8-bit
quantities that represent unsigned 8-bit values (e.g., pixel values in a bitmap) and char
is always a signed quantity that usually represents a text character

Make sure that any time you have used sizeof() with respect to a character array that you
now change it to multiply or divide by sizeof(TCHAR) when needed.

char p[] = "This is a test";
WriteFile(h, p, strlen(p), &bytesWritten, NULL);

has to become

TCHAR p[] = _T("This is a test");
WriteFile(h, p, tcslen(p)*sizeof(TCHAR), &bytesWritten, NULL);

because tcslen is now the length in CHARACTERS, and WriteFile wants a length in BYTES. So
to get the correct length, you have to multiply by sizeof(TCHAR). [That might be _tcslen,
by the way, I always have to look them up...]

But since the size of p is actually known at compile time, had you written

WriteFile(h, p, sizeof(p), &bytesWritten, NULL);

it has to remain

TCHAR p[] = _T("This is a test");
WriteFile(h, p, sizeof(p), &bytesWritten);

because you want to write all the bytes. Similarly, had you written

char buffer[MAX_PATH];
GetModuleFileName(NULL, buffer, sizeof(buffer));

you now have to write

TCHAR buffer[MAX_PATH];
GetModuleFileName(NULL, buffer, sizeof(buffer)/sizeof(TCHAR));

because sizeof(buffer) would return 2 * MAX_PATH (sizeof is ALWAYS in bytes) but you only
have MAX_PATH characters available, so you have to compensate for it.

Life gets dicey if you have to deal with 8-bit external data, such as network messages,
files, etc. Here you really do have to use char, and convert the 8-bit data to Unicode
data. Exactly what the best strategy for this is depends on your application domain.

For pure Unicode conversion, you can use the macros A2W to convert 8-bit to 16-bit data.
You can also use CStringA to hold 8-bit data and pass it around to other CStringA values.
But if you use a CStringW (or a CString, in Unicode mode) there will be an automatic
conversion if you use a constructor, and a compiler error if you try to use assignment.

If you need more control over what is going on, or need some encoding like UTF-8, you can
use MultiByteToWideChar yourself, and to convert back you can use WideCharToMultiByte.
(A2W and W2A will do the job; if you need compatibility with an ANSI version, use A2T and
T2A, for example).

A Unicode file will often (but is not mandated) to start with FFFE (little-endian, such as
x86 machines), or FEFF (big endian, such as 68K, PowerPC, Sparc), to indicate the
"endianness" of the Unicode characters. This is referred to as the Byte Order Mark (BOM).
If you open a file and find this, you have a high confidence it is Unicode, and you
actually know the encoding. You would throw this away as a non-interesting character. If
you don't find it, you have a reasonable confidence it is a file of 8-bit bytes. In
general, you might want to extend the file-open dialog with a dropdown with options such
as "Open as Unicode", "Open as 8-bit characters", or "Auto-detect" (which should be the
default) to give the user specific control when the BOM is either absent or misleading.

THat may be about 95% of what you need to do to make things Unicode-compliant. I've
usually managed to do this edits in a few days, and it almost always works correctly in
all cases. But I've been nuked occasionally by missing a key 8-to-16, 16-to-8, or
never-translate-from-8-bit situation.

This is why I always now code "Unicode-aware", and in those places where I have to worry
about it but it isn't critical, I make sure to leave comments and have a piece of code
that won't compile if Unicode is enabled (some clients don't want to pay for full Unicode
compliance in the deliverable)
joe
joe
On Wed, 19 Jul 2006 20:01:02 -0700, William GS <WilliamGS@xxxxxxxxxxxxxxxxxxxxxxxxx>
wrote:

Hello everybody. I have a project created with VS Wizard (VC6), I have to
compile it with UNICODE compliance in VS2005; what settings have I to change?
is there another change to do?

Thanks in advance,
William GS
Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.



Relevant Pages

  • Re: heeeeeeeeeeeeeeeellllllllllllllppppppppppppppppppppp
    ... This means that if you develop the bad habit of using char * (left over ... It usually takes me five minutes to create a Unicode version of any of my apps, ... BOOL and bool are different data types. ... can be up to MAX_PATH characters). ...
    (microsoft.public.vc.mfc)
  • Re: passing char * to dll
    ... char is a single byte thing so there are only 256 different possible values. ... This makes it difficult to have fonts of more than 256 characters. ... Unicode is a way of having all possible characters (latin, chinese, arabic, ... The idea was that you write your code using TCHAR everywhere and TCHAR is ...
    (microsoft.public.vc.mfc)
  • Re: Schleife =?iso-8859-15?Q?=FCber_die_Codepoints?=
    ... The char data type are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. ... Characters whose code points are greater than U+FFFF are called supplementary characters. ...
    (de.comp.lang.java)
  • Re: How to LPCTSTR Convert to char *
    ... number of people who use 'char' because they've never grown beyond their first programming ... These are the people who are getting nuked by VS2005 which defaults to Unicode apps. ... isolated to the embedded interface (rare and exotic situation imposed by external ... fields with char strings is quite essential. ...
    (microsoft.public.vc.mfc)
  • Re: writing (char) 129 to file
    ... char is 2 bytes long. ... I think that given the situation, we came up with the most reasonable solution for 1.5. ... Unicode had evolved past 65k characters for a long time...frankly, we ignored it as long as possible. ...
    (comp.lang.java.programmer)

Quantcast