Re: Proposal to extend documentation about interop

From: Robert Jordan (robertj_at_gmx.net)
Date: 10/19/04


Date: Wed, 20 Oct 2004 00:19:58 +0200

John Allberg wrote:

> Hi!
>
> I think the MSDN docs about interop doesn't state clearly enough that there
> is a character encoding conversion automaticly done from Unicode to the
> characterset for the computer during interop.
>
> This got me really puzzled for a few days.
>
> I've got a legacy C-application (dll) that takes in UTF-8 encoded strings in
> an array of structs. I call that C-dll from C# which works fine, as long as
> I use ansi characters, for example english. When sending in swedish
> characters (where the utf-8 encoding becomes two bytes) such as åäö the
> lowercase works fine, but the uppercase ÅÄÖ simply comes out as invalid
> utf-8 encoding of the character FF.
>
> I solved it by doing the conversion of UTF-8 to bytes and when going back to
> string used the encoder for "Default" and converting those bytes to a
> "unicode-string". That way, when the interop converts the string from
> Unicode to "Default", the UTF-8 once again surfaces.
>
> So my suggestion is to update the MSDN doc to state this conversion clearly
> enough.

The automatic p/invoke conversion can be applied only to those
legacy types:

- LPSTR (ansi encoding, 1 byte)
- LPWSTR (unicode encoding. 2 bytes)
- LPTSTR (platform specific, one of the above)
- BSTR

You cannot properly import UTF-8 because Win32 doesn't support
UTF-8 for the legacy API either.

bye
Rob



Relevant Pages

  • Re: Understanding simplest HTML page
    ... Even the BBC managed to put invalid ... > technical details of using a particular encoding, ... Bengali and so on using utf-8 ... Mozilla has routines for automatically guessing at character ...
    (comp.infosystems.www.authoring.html)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character
    ... SPACE in some other encoding. ... headers that define the character set. ... define the character set as UTF-8, ... encoded in Mac-Roman. ...
    (alt.html)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. ... Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to *use*. ... A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. ... I would have little more than UNICODE left. ...
    (comp.arch.embedded)
  • Re: Loading a data file containing character fields with different encodings
    ... UTF-8 characters along with Latin-1 characters. ... One containing the latin-1 character set column, the second containing the utf-8 column and of course both files containing the primary key information. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ... This caused that no conversion was done, but you were puting CP1252 characters into an 819 database! ...
    (comp.databases.informix)

Loading