Re: MultibyteToWideChar not working properly?



See below...
On Thu, 24 Apr 2008 12:14:10 -0700 (PDT), "ALEKS!" <aleksander.morgado@xxxxxxxxx> wrote:

Hi all,

I am trying to use MultiByteToWideChar function to detect invalid
encoded strings (in US-ASCII 7-bit for example). The input string I
use is UTF-8 encoded E2:82:AC (euro sign), which is not a valid
ASCII-7 string.

The call is as follows:
char *input_data = "\342\202\254"; /* octal */
****
If you have data that you think of in hex, converting it to octal is a bit roundabout; why
not write

"\xE2\x82\xAC"
?
****
long n_out = MultiByteToWideChar((UNIT)20127, MB_ERR_INVALID_CHARS,
input_data, 3, NULL, 0);
****
What is 20127? Some random magical number? Perhaps a comment that this is US-ASCII 7-bit
code page, or a #define or static const UINT, would have helped...
****

When passing NULL in lpWideCharStr and 0 in cchWideChar I am querying
the function to get the number of the output wide chars after
conversion. What I expected was a return of 0 in n_out, as the input
string is not a valid ASCII-7 character, but the output I get is 3,
what means that 3 wide chars are obtained.
****
That's probably right.
****

If I pass a valid output place to store the output string, I get this
UTF-16LE string: 62:00:02:00:2C:00, which seems the string that should
be obtained when only reading the 7bits of each input byte. But: why
is the MSB not read?
****
Probably because you told it not to read it! You DID say that it is 7-bit data, so it
used only the low-order 7 bits. It will NOT treat what you clearly told it as 7-bit data
as UTF-8 encoding. So it is doing precisely the correct thing, using precisely the data
you told it to use.
****
Why don't I get a ERROR_NO_UNICODE_TRANSLATION or
such? Why does it work?
****
It works because it is supposed to. It is doing what you asked.

If your input string is encoded in UTF-8, then the ONLY code page you can use for the
translation is CP_UTF8. You will convert it to Unicode.

Now, you can ask it to conver the Unicode back to 20127 (US-ASCII 7-bit), and if these is
an illegal character, it will indicate that there is a problem, because
WideCharToMultibyte will set the LPBOOL parameter to indicate that there was a translation
error.
****

The idea is to detect if a given string is valid in a given encoding,
not only ASCII-7.
****
You can't ask it to treat UTF-8 as ASCII-7 and expect that it will translate correctly. It
will do exactly what you asked, which is to treat the input string as a sequence of 7-bit
ASCII bytes, which it does by ignoring the high-order bit (which is probably the parity
bit).

By the way, I tried the technique I suggest above using my Locale Explorer (which you can
download from my MVP Tips site; just select the MultiByte tab) and it returns ? for the
result. If you set the lpUsedDefault radio button to "variable" it will actually tell you
it set this value to TRUE
joe
****

Thanks for the help in advance.
Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.