Re: Bug in vstudio.NET 2003 codecvt facet

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance




Tom Widmer [VC++ MVP] wrote:
> JH Trauntvein wrote:
> > Consider the following example:
> >
> > #include <sstream>
> > #include <locale>
> >
> >
> > int main()
> > {
> > std::wstringstream test;
> > char const *hello_world =
> > "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x90\xa2\x8a\x45";
> > test.imbue(std::locale("Japanese_Japan.932"));
> > test << hello_world;
>
> The line above performs (or is supposed to perform) a simple encoding
> from single char characters to single wchar_t characters, using
> ctype::widen, as follows:
>
> Effects: Applies the simplest reasonable transformation from a char
> value or sequence of char values to the corresponding charT value or
> values.223) The only characters for which unique transformations are
> required are those in the basic source character set (2.2). For any
> named ctype category with a ctype<charT> facet ctw and valid
> ctype_base::mask value M (is(M, c) || !ctw.is(M, do_widen(c)) ) is
> true.224) The second form transforms each character *p in the range
> [low, high), placing the result in dest[p-low].
>
> If you wanted to do a full conversion from a multibyte string to a wide
> character string, see below.
>
> > return 0;
> > } // main
> >
> > The string used to initialise the "hello_world" variable is the
> > japanese translation of the english, "Hello World" obtained through
> > google and is encoded using shift-jis (code page=932). The wide stream
> > uses the codecvt facet to "widen" the characters in the multi-byte
> > string but the conversion fails. If you trace this in the debugger,
> > you can find function _Mbrtowc() which invokes a macro,
> > _cpp_isleadbyte(), on the first character. This macro always fails so
> > the case that is supposed to handle the conversion does not get
> > executed. I found this while using the codecvt facet to widen
> > multi-byte strings into unicode strings.
>
> Yes, I agree this looks like a bug. The problem seems to be that
> _cpp_isleadbyte uses the global locale rather than the specific locale
> specified for the conversion.
>
> For example, this code works if and only if you comment in the
> std::locale::global line:
>
> #include <sstream>
> #include <locale>
> #include <string>
> #include <fstream>
> #include <vector>
> int main()
> {
> char const hello_world[] =
> "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x90\xa2\x8a\x45";
>
> std::locale jis("Japanese_Japan.932");
> //std::locale::global(jis);
> std::codecvt<wchar_t, char, mbstate_t> const& facet =
> std::use_facet<std::codecvt<wchar_t, char, mbstate_t> >(jis);
>
> wchar_t s[100] = {};
> char const* cpos = 0;
> wchar_t* wpos = 0;
> mbstate_t state = mbstate_t();
> facet.in(state, hello_world, hello_world + sizeof(hello_world),
> cpos, &s[0], &s[100], wpos);
>
> std::wstring ws(&s[0], wpos);
>
> return 0;
> }


The program where we found the problem was actually using the
codecvt::in() method as you described above. As you already mention,
this will only work if I am will to set the global locale which I am
unwilling to do since it can cause unexpected side effects in other
places (for instance when I am formatting data in a file with rigid
syntax rules). If I were to do that, i might as well use the "C"
mbtowc() function which is much simpler than using the facet. My only
reason to use the facets is because of the broken promise that it will
work with a local locale.

Regards,

Jon Trauntvein

.



Relevant Pages

  • Re: Base36
    ... static string tokens = ... But - I don't think you want all those silly characters in the product key. ... I should be able to recalc the hash at the client ... > conversion to long so I can pass each long to the BaseXX converter to get ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: newbie: mapping CHARACTER*2 to INTEGER*2
    ... that destroys the Chinese characters in my input. ... effect the conversion required. ... If MOLD is an array and SIZE is omitted, ... represent the values 4.0 and 1082130432 as the string of binary digits ...
    (comp.lang.fortran)
  • Re: Convert Binary String to Hexadecimal
    ... character representation of an integer value using binary notation. ... The hexadecimal equivalent of the 32-bit binary string ... the characters. ... You don't want your conversion function to open the file and read the ...
    (comp.lang.c)
  • Re: Tranfering unicod charcters in Socket programming!
    ... Unicode problem... ... string, and you specify the correct size ... case of sending only 4 characters. ... You are telling about conversion b/w MBCS to Unicode. ...
    (microsoft.public.win32.programmer.networks)
  • Re: How to convert Infix notation to postfix notation
    ... If this is for an error message, why isn't it using stderr for its output? ... array of 15 characters, and you call this function with the limit 15 on ... Making sure that the only string I allocate and append to, ... because mulFactor in all versions must needs incorporate the functions ...
    (comp.lang.c)