Re: Is this Regular Expression for UTF-8 Correct??

"Mihai N." <nmihai_year_2000@xxxxxxxxx> wrote in message

I can imagine a lot of alternative approaches, including
having a table of
65,536 "character masks" for Unicode characters

As we know, 65,536 (FFFF) is not enough, Unicode
codepoints go to 10FFFF :-)

What is your crtiterion for what constitutes a "letter"?

The best way to attack the identification is by using
Unicode properties
Each code point has attributes indicating if it is a
(General Category)

A good starting point is this:

But this only shows that basing that on some UTF-8 kind of
thing is no
the way. And how are you going to deal with combining

I am going to handle this simplistically. Every code point
above the ASCII range will be considered an alpha numeric

Eventually I will augment this to further divide these code
points into smaller categories. Unicode is supposed to have
a way to do this, but, I never could find anything as simple
as a table of the mapping of Unicode code points to their

There are very good reasons why the rule of thumb is:
- UTF-16 or UTF-32 for processing
- UTF-8 for storage/exchange

Mihai Nita [Microsoft MVP, Visual C++]
Replace _year_ with _ to get the real email


Relevant Pages

  • Re: VB - Ascii to Unicode and then Unicode to UTF-8 conversion (Very desperate!!)
    ... Latin together) then you have to use a Unicode column type. ... AscW returns the real Unicode character ... for Chinese characters, ... then the next thing to worry about is your CSV file. ...
  • Re: case-sensitivity
    ... I think that Unicode identifiers make things worse for the reasons ... a good character set standard waiting to be uncovered. ... codepoints paying particular attention to mirroring ...
  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
  • Re: UCS Identifiers and compilers
    ... the language, particularly in identifiers. ... context dependent glyphs for the same character, ... That's a problem with Unicode, on a couple of different levels. ... repeat particular characters at different codepoints in unicode. ...
  • Re: Determining if a string is Unicode
    ... there's nothing magic about Unicode. ... where each character occupies 2 bytes, as opposed to a Single-Byte Character ... You could load up a string with rubbish, ... > INF file like so: ...