Re: Bug in vstudio.NET 2003 codecvt facet
- From: "JH Trauntvein" <j.trauntvein@xxxxxxxxxxx>
- Date: 30 Nov 2005 10:25:36 -0800
Tom Widmer [VC++ MVP] wrote:
> JH Trauntvein wrote:
> > Consider the following example:
> >
> > #include <sstream>
> > #include <locale>
> >
> >
> > int main()
> > {
> > std::wstringstream test;
> > char const *hello_world =
> > "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x90\xa2\x8a\x45";
> > test.imbue(std::locale("Japanese_Japan.932"));
> > test << hello_world;
>
> The line above performs (or is supposed to perform) a simple encoding
> from single char characters to single wchar_t characters, using
> ctype::widen, as follows:
>
> Effects: Applies the simplest reasonable transformation from a char
> value or sequence of char values to the corresponding charT value or
> values.223) The only characters for which unique transformations are
> required are those in the basic source character set (2.2). For any
> named ctype category with a ctype<charT> facet ctw and valid
> ctype_base::mask value M (is(M, c) || !ctw.is(M, do_widen(c)) ) is
> true.224) The second form transforms each character *p in the range
> [low, high), placing the result in dest[p-low].
>
> If you wanted to do a full conversion from a multibyte string to a wide
> character string, see below.
>
> > return 0;
> > } // main
> >
> > The string used to initialise the "hello_world" variable is the
> > japanese translation of the english, "Hello World" obtained through
> > google and is encoded using shift-jis (code page=932). The wide stream
> > uses the codecvt facet to "widen" the characters in the multi-byte
> > string but the conversion fails. If you trace this in the debugger,
> > you can find function _Mbrtowc() which invokes a macro,
> > _cpp_isleadbyte(), on the first character. This macro always fails so
> > the case that is supposed to handle the conversion does not get
> > executed. I found this while using the codecvt facet to widen
> > multi-byte strings into unicode strings.
>
> Yes, I agree this looks like a bug. The problem seems to be that
> _cpp_isleadbyte uses the global locale rather than the specific locale
> specified for the conversion.
>
> For example, this code works if and only if you comment in the
> std::locale::global line:
>
> #include <sstream>
> #include <locale>
> #include <string>
> #include <fstream>
> #include <vector>
> int main()
> {
> char const hello_world[] =
> "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x90\xa2\x8a\x45";
>
> std::locale jis("Japanese_Japan.932");
> //std::locale::global(jis);
> std::codecvt<wchar_t, char, mbstate_t> const& facet =
> std::use_facet<std::codecvt<wchar_t, char, mbstate_t> >(jis);
>
> wchar_t s[100] = {};
> char const* cpos = 0;
> wchar_t* wpos = 0;
> mbstate_t state = mbstate_t();
> facet.in(state, hello_world, hello_world + sizeof(hello_world),
> cpos, &s[0], &s[100], wpos);
>
> std::wstring ws(&s[0], wpos);
>
> return 0;
> }
The program where we found the problem was actually using the
codecvt::in() method as you described above. As you already mention,
this will only work if I am will to set the global locale which I am
unwilling to do since it can cause unexpected side effects in other
places (for instance when I am formatting data in a file with rigid
syntax rules). If I were to do that, i might as well use the "C"
mbtowc() function which is much simpler than using the facet. My only
reason to use the facets is because of the broken promise that it will
work with a local locale.
Regards,
Jon Trauntvein
.
- Follow-Ups:
- Re: Bug in vstudio.NET 2003 codecvt facet
- From: P.J. Plauger
- Re: Bug in vstudio.NET 2003 codecvt facet
- References:
- Re: Bug in vstudio.NET 2003 codecvt facet
- From: Tom Widmer [VC++ MVP]
- Re: Bug in vstudio.NET 2003 codecvt facet
- Prev by Date: Re: STL Debug Lib performance
- Next by Date: Re: Bug in vstudio.NET 2003 codecvt facet
- Previous by thread: Re: Bug in vstudio.NET 2003 codecvt facet
- Next by thread: Re: Bug in vstudio.NET 2003 codecvt facet
- Index(es):
Relevant Pages
|