Re: Is this Regular Expression for UTF-8 Correct??




"Mihai N." <nmihai_year_2000@xxxxxxxxx> wrote in message
news:Xns9D78F42C2233MihaiN@xxxxxxxxxxxxxxxx

I can imagine a lot of alternative approaches, including
having a table of
65,536 "character masks" for Unicode characters

As we know, 65,536 (FFFF) is not enough, Unicode
codepoints go to 10FFFF :-)



What is your crtiterion for what constitutes a "letter"?

The best way to attack the identification is by using
Unicode properties
Each code point has attributes indicating if it is a
letter
(General Category)

A good starting point is this:
http://unicode.org/reports/tr31/tr31-1.html

But this only shows that basing that on some UTF-8 kind of
thing is no
the way. And how are you going to deal with combining
characters?
Normalization?

I am going to handle this simplistically. Every code point
above the ASCII range will be considered an alpha numeric
character.

Eventually I will augment this to further divide these code
points into smaller categories. Unicode is supposed to have
a way to do this, but, I never could find anything as simple
as a table of the mapping of Unicode code points to their
category.


There are very good reasons why the rule of thumb is:
- UTF-16 or UTF-32 for processing
- UTF-8 for storage/exchange


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email



.