Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: Chris Mullins (cmullins_at_yahoo.com)
Date: 04/23/04


Date: Fri, 23 Apr 2004 11:03:52 -0700

I try to post a more clear explination.

I've got Unicode Code points that are larger than 0xFFFF - these are encoded
(according to the Unicode Spec) into UTF-8 using the Surrogate Pair
Algorithim. The .NET UTF8 implementation seems to properly handle Surrogate
Pairs quite well - if I encode surrogate pairs into UTF8, iterate over them
using the StringInfo methods, I get the proper graphemes.

Point 2 - Surrogate code units are illegal in UTF8 doesn't make sense. The
Unicode spec makes no mention of this that I can find, and I see no other
way of encoding 32 bit codepoints into UTF8.

At the end of the day my question is this: I need a full implementation of
RFC 3454 (aka 'StringPrep', ftp://ftp.isi.edu/in-notes/rfc3454.txt) that
works with all codepoints specified in the RFC (many of which are 32 bit
codepoints). Is this RFC possible to implement in .NET?

For Example - One of the steps in StringPrep is to compare all the
codepoints in the String against the various tables of "illegal characters".
If any of the CodePoints in the string matches one of the illegal
characters, the string has failed StringPrep. Many of these codepoints
required a 32 bit representation, hence in UTF8 they must be encoded as
surrogate pairs.

So far the only way I've been able to look at the Surrogate Pairs as a
single code point has been the following code:

Dim si As New System.Globalization.StringInfo
Dim myTEE As System.Globalization.TextElementEnumerator =
si.GetTextElementEnumerator(stringToTest)
myTEE.Reset()
While myTEE.MoveNext()
    Dim CodePoint As Integer
    Dim grapheme As String = myTEE.GetTextElement
    If grapheme.Length > 1 Then
        Dim uc As Char = grapheme.Chars(0)
        Dim lc As Char = grapheme.Chars(1)
        CodePoint = ((AscW(uc) - &HD800) * &H400) + AscW(lc) - &HDC00 +
&H10000
    Else
        CodePoint = AscW(grapheme)
    End If

    If ResourcePrepTables.ContainsKey(CodePoint) Then Return False
End While

In the code above, the ResourcePrepTables is a hash table with all of the
"illegal" characters (represented as int32) stored in it. The CodePoint
algorithm is taken from Section 3.7 of http://www.unicode.org/book/ch03.pdf

My algorithm is obviously flawed, as it's using Grapheme's rather than code
points so things like combining characters are falling through the cracks -
but I am at a loss as to determine any other way to do it.

Please, please, suggest a viable alternative!

-- 
Chris Mullins
"Yan-Hong Huang[MSFT]" <yhhuang@online.microsoft.com> wrote in message
news:q$6CYPNKEHA.3048@cpmsftngxa10.phx.gbl...
> Hi Chris,
>
> Here is the response that I got from our Windows Globalization Software
> Design Engineer.
>
> ----------------
>
> A few corrections:
>
> 1) Surrogate code units are illegal in UTF-32 (only full code points are
> acceptable).
> 2) Surrogate code units are also illegal in UTF-8 (only the 4-byte from of
> supplementary characters is acceptable).
>
> For the above, it is legal to accept them if a process wants to for
> backcompat reasons , but it is completely illegal for a conformant process
> to emit them.
>
> Note that grapheme clusters (called Text Elements in .NET) are not always
> representable as single UTF-32 code points (there are many composite forms
> that have no precomposed from in them, since the precomposed form is only
> added to Unicode for backcompat reasons).
>
> So, given the above (which seems to contradict your problem descripition
in
> several places), what is the question, exactly?
> ------------------
>
> Thanks very much.
>
> Best regards,
> Yanhong Huang
> Microsoft Community Support
>
> Get Secure! ¨C www.microsoft.com/security
> This posting is provided "AS IS" with no warranties, and confers no
rights.
>


Relevant Pages

  • Re: CLisp case sensitivity
    ... >>then the length of the string changes on casing operations. ... non-canonical characters that change actual length on ... But making strings out of grapheme clusters rather than ... codepoints makes these occurrences rare instead of ridiculously ...
    (comp.lang.lisp)
  • Re: Lisps other than CLISP that support full Unicode character repertoire?
    ... > Unicode codepoints, in many cases, are not characters. ... They didn't yet detail how to manipulate parts of characters ... or UTF-8 units as elements of string representation. ...
    (comp.lang.lisp)
  • Re: GetTextExtentExPoint slow for characters greater than codepoint 127
    ... Feng Yuan [MSFT] http://blogs.msdn.com/fyuan ... contains codepoints above 127. ... character string composed of only characters below codepoint 127. ...
    (microsoft.public.win32.programmer.gdi)
  • Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET
    ... !(according to the Unicode Spec) into UTF-8 using the Surrogate Pair ... !Pairs quite well - if I encode surrogate pairs into UTF8, ... !way of encoding 32 bit codepoints into UTF8. ... !If any of the CodePoints in the string matches one of the illegal ...
    (microsoft.public.dotnet.framework)
  • Re: Acquiring UTF-8 string length
    ... MultiByteToWideChar will tell you how many UTF-16 words ... you need to represent the string in UTF-16 - not how many ... UNICODE codepoints it contains. ... I don't know of any API that does that. ...
    (microsoft.public.vc.language)