Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET
From: Chris Mullins (cmullins_at_yahoo.com)
Date: 04/23/04
- Next message: Frederik: "context menu folder"
- Previous message: Sunny: "Re: Windows service vs. IIS service"
- In reply to: Yan-Hong Huang[MSFT]: "RE: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Next in thread: Jon Skeet [C# MVP]: "Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Reply: Jon Skeet [C# MVP]: "Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Reply: Yan-Hong Huang[MSFT]: "Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Messages sorted by: [ date ] [ thread ]
Date: Fri, 23 Apr 2004 11:03:52 -0700
I try to post a more clear explination.
I've got Unicode Code points that are larger than 0xFFFF - these are encoded
(according to the Unicode Spec) into UTF-8 using the Surrogate Pair
Algorithim. The .NET UTF8 implementation seems to properly handle Surrogate
Pairs quite well - if I encode surrogate pairs into UTF8, iterate over them
using the StringInfo methods, I get the proper graphemes.
Point 2 - Surrogate code units are illegal in UTF8 doesn't make sense. The
Unicode spec makes no mention of this that I can find, and I see no other
way of encoding 32 bit codepoints into UTF8.
At the end of the day my question is this: I need a full implementation of
RFC 3454 (aka 'StringPrep', ftp://ftp.isi.edu/in-notes/rfc3454.txt) that
works with all codepoints specified in the RFC (many of which are 32 bit
codepoints). Is this RFC possible to implement in .NET?
For Example - One of the steps in StringPrep is to compare all the
codepoints in the String against the various tables of "illegal characters".
If any of the CodePoints in the string matches one of the illegal
characters, the string has failed StringPrep. Many of these codepoints
required a 32 bit representation, hence in UTF8 they must be encoded as
surrogate pairs.
So far the only way I've been able to look at the Surrogate Pairs as a
single code point has been the following code:
Dim si As New System.Globalization.StringInfo
Dim myTEE As System.Globalization.TextElementEnumerator =
si.GetTextElementEnumerator(stringToTest)
myTEE.Reset()
While myTEE.MoveNext()
Dim CodePoint As Integer
Dim grapheme As String = myTEE.GetTextElement
If grapheme.Length > 1 Then
Dim uc As Char = grapheme.Chars(0)
Dim lc As Char = grapheme.Chars(1)
CodePoint = ((AscW(uc) - &HD800) * &H400) + AscW(lc) - &HDC00 +
&H10000
Else
CodePoint = AscW(grapheme)
End If
If ResourcePrepTables.ContainsKey(CodePoint) Then Return False
End While
In the code above, the ResourcePrepTables is a hash table with all of the
"illegal" characters (represented as int32) stored in it. The CodePoint
algorithm is taken from Section 3.7 of http://www.unicode.org/book/ch03.pdf
My algorithm is obviously flawed, as it's using Grapheme's rather than code
points so things like combining characters are falling through the cracks -
but I am at a loss as to determine any other way to do it.
Please, please, suggest a viable alternative!
-- Chris Mullins "Yan-Hong Huang[MSFT]" <yhhuang@online.microsoft.com> wrote in message news:q$6CYPNKEHA.3048@cpmsftngxa10.phx.gbl... > Hi Chris, > > Here is the response that I got from our Windows Globalization Software > Design Engineer. > > ---------------- > > A few corrections: > > 1) Surrogate code units are illegal in UTF-32 (only full code points are > acceptable). > 2) Surrogate code units are also illegal in UTF-8 (only the 4-byte from of > supplementary characters is acceptable). > > For the above, it is legal to accept them if a process wants to for > backcompat reasons , but it is completely illegal for a conformant process > to emit them. > > Note that grapheme clusters (called Text Elements in .NET) are not always > representable as single UTF-32 code points (there are many composite forms > that have no precomposed from in them, since the precomposed form is only > added to Unicode for backcompat reasons). > > So, given the above (which seems to contradict your problem descripition in > several places), what is the question, exactly? > ------------------ > > Thanks very much. > > Best regards, > Yanhong Huang > Microsoft Community Support > > Get Secure! ¨C www.microsoft.com/security > This posting is provided "AS IS" with no warranties, and confers no rights. >
- Next message: Frederik: "context menu folder"
- Previous message: Sunny: "Re: Windows service vs. IIS service"
- In reply to: Yan-Hong Huang[MSFT]: "RE: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Next in thread: Jon Skeet [C# MVP]: "Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Reply: Jon Skeet [C# MVP]: "Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Reply: Yan-Hong Huang[MSFT]: "Re: UTF32 CodePoints, UTF8 Combining Chars / Surrogate Pairs, and .NET"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|