Re: Is this Regular Expression for UTF-8 Correct??




"Mihai N." <nmihai_year_2000@xxxxxxxxx> wrote in message
news:Xns9D78F42C2233MihaiN@xxxxxxxxxxxxxxxx

I can imagine a lot of alternative approaches, including
having a table of
65,536 "character masks" for Unicode characters

As we know, 65,536 (FFFF) is not enough, Unicode
codepoints go to 10FFFF :-)



What is your crtiterion for what constitutes a "letter"?

The best way to attack the identification is by using
Unicode properties
Each code point has attributes indicating if it is a
letter
(General Category)

A good starting point is this:
http://unicode.org/reports/tr31/tr31-1.html

But this only shows that basing that on some UTF-8 kind of
thing is no
the way. And how are you going to deal with combining
characters?
Normalization?

I am going to handle this simplistically. Every code point
above the ASCII range will be considered an alpha numeric
character.

Eventually I will augment this to further divide these code
points into smaller categories. Unicode is supposed to have
a way to do this, but, I never could find anything as simple
as a table of the mapping of Unicode code points to their
category.


There are very good reasons why the rule of thumb is:
- UTF-16 or UTF-32 for processing
- UTF-8 for storage/exchange


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email



.



Relevant Pages

  • Re: case-sensitivity
    ... I think that Unicode identifiers make things worse for the reasons ... a good character set standard waiting to be uncovered. ... codepoints paying particular attention to mirroring ...
    (comp.lang.scheme)
  • Re: VB - Ascii to Unicode and then Unicode to UTF-8 conversion (Very desperate!!)
    ... Latin together) then you have to use a Unicode column type. ... AscW returns the real Unicode character ... for Chinese characters, ... then the next thing to worry about is your CSV file. ...
    (microsoft.public.vb.general.discussion)
  • Re: UCS Identifiers and compilers
    ... the language, particularly in identifiers. ... context dependent glyphs for the same character, ... That's a problem with Unicode, on a couple of different levels. ... repeat particular characters at different codepoints in unicode. ...
    (comp.compilers)
  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
    (alt.lang.asm)
  • Re: Determining if a string is Unicode
    ... there's nothing magic about Unicode. ... where each character occupies 2 bytes, as opposed to a Single-Byte Character ... You could load up a string with rubbish, ... > INF file like so: ...
    (microsoft.public.vb.general.discussion)