Re: get_innerHTML on body returns ? characters for UNICODE text



Craig Swearingen <swearing@xxxxxxxxxx> wrote:
> 1) When I use IHTMLElement::get_innerHTML on a BODY element I can get
> back the HTML and text but if it has UNICODE text within it I see '?'
> characters for it instead of UNICODE character references like
> '&#27599;' for each character.

get_innerHTML returns a Unicode string. There is no need to escape those
characters with character entities, they are just being represented
directly as Unicode codepoints. Now, whatever it is you are using to
view the content clearly does not support rendering those characters,
that's why you see question marks.

> IHTMLTxtRange::get_htmlText has the
> same problem. How can I tell it that I want the UNICODE character
> references too?

You can't. If you want that, you need to post-process the string and
replace all Unicode codepoints you deem "unsafe" with character
references.

> 2) If I do a save using a IPersistStreamInit I get what I expect for
> UNICODE text too.

This method gives you the data exactly as sent by the server, with no
processing at all. Among other things, this means that it is not
necessarily Unicode text - the server could send the content in some
legacy encoding.

> However, this method doesn't notice changes if
> you've been editing. Is there a way to submit the changes such that
> the stream will be the current state before I get a copy of it?

None that I know of.
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925


.



Relevant Pages

  • Re: VB - Ascii to Unicode and then Unicode to UTF-8 conversion (Very desperate!!)
    ... Latin together) then you have to use a Unicode column type. ... AscW returns the real Unicode character ... for Chinese characters, ... then the next thing to worry about is your CSV file. ...
    (microsoft.public.vb.general.discussion)
  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
    (alt.lang.asm)
  • Re: Determining if a string is Unicode
    ... there's nothing magic about Unicode. ... where each character occupies 2 bytes, as opposed to a Single-Byte Character ... You could load up a string with rubbish, ... > INF file like so: ...
    (microsoft.public.vb.general.discussion)
  • Re: KANJD212
    ... >>Who decides the factors and what are their criteria, Unicode? ... But once a character is defined/get a codepoint in Unicode it ... standard modifies the codepoint of the kanji to a totally new ... I can use a code like JIS X0208 along with a font ...
    (sci.lang.japan)
  • Re: Enhanced Unicode support for "Go" tools
    ... the point to remember is that UNICODE is a _character ... It's the fonts, the OS and the application which work together ... society for the protection of French from English ...
    (alt.lang.asm)