Re: Is this Regular Expression for UTF-8 Correct??

"Mihai N." <nmihai_year_2000@xxxxxxxxx> wrote in message

I can imagine a lot of alternative approaches, including
having a table of
65,536 "character masks" for Unicode characters

As we know, 65,536 (FFFF) is not enough, Unicode
codepoints go to 10FFFF :-)

What is your crtiterion for what constitutes a "letter"?

The best way to attack the identification is by using
Unicode properties
Each code point has attributes indicating if it is a
(General Category)

A good starting point is this:

But this only shows that basing that on some UTF-8 kind of
thing is no
the way. And how are you going to deal with combining

I am going to handle this simplistically. Every code point
above the ASCII range will be considered an alpha numeric

Eventually I will augment this to further divide these code
points into smaller categories. Unicode is supposed to have
a way to do this, but, I never could find anything as simple
as a table of the mapping of Unicode code points to their

There are very good reasons why the rule of thumb is:
- UTF-16 or UTF-32 for processing
- UTF-8 for storage/exchange

Mihai Nita [Microsoft MVP, Visual C++]
Replace _year_ with _ to get the real email