Re: Is this Regular Expression for UTF-8 Correct??
- From: "Peter Olcott" <NoSpam@xxxxxxxxxxxxxx>
- Date: Fri, 14 May 2010 08:36:21 -0500
"Mihai N." <nmihai_year_2000@xxxxxxxxx> wrote in message
news:Xns9D78F42C2233MihaiN@xxxxxxxxxxxxxxxx
I can imagine a lot of alternative approaches, including
having a table of
65,536 "character masks" for Unicode characters
As we know, 65,536 (FFFF) is not enough, Unicode
codepoints go to 10FFFF :-)
What is your crtiterion for what constitutes a "letter"?
The best way to attack the identification is by using
Unicode properties
Each code point has attributes indicating if it is a
letter
(General Category)
A good starting point is this:
http://unicode.org/reports/tr31/tr31-1.html
But this only shows that basing that on some UTF-8 kind of
thing is no
the way. And how are you going to deal with combining
characters?
Normalization?
I am going to handle this simplistically. Every code point
above the ASCII range will be considered an alpha numeric
character.
Eventually I will augment this to further divide these code
points into smaller categories. Unicode is supposed to have
a way to do this, but, I never could find anything as simple
as a table of the mapping of Unicode code points to their
category.
There are very good reasons why the rule of thumb is:
- UTF-16 or UTF-32 for processing
- UTF-8 for storage/exchange
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
.
- References:
- Re: Is this Regular Expression for UTF-8 Correct??
- From: Joseph M . Newcomer
- Re: Is this Regular Expression for UTF-8 Correct??
- From: Mihai N.
- Re: Is this Regular Expression for UTF-8 Correct??
- Prev by Date: Re: Is this Regular Expression for UTF-8 Correct??
- Next by Date: Re: Is this Regular Expression for UTF-8 Correct??
- Previous by thread: Re: Is this Regular Expression for UTF-8 Correct??
- Next by thread: Re: Is this Regular Expression for UTF-8 Correct??
- Index(es):
Relevant Pages
|