Re: Replace special characters

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Actually, it looks like this thing centers around the categories as defined by VBScript's regular expression engine.

I don't know a great deal about Unicode - just some things from mucking around with high-order special characters a few years ago. What you're saying is actually rooted in the regular expression engine, not the LCID - although it could come out to the same thing if a particular regex engine uses arbitrary OS APIs to classify characters.

I did some testing, using the following code in WSH to see what characters are considered "word characters" by VBScript:

set rx = new regexp: rx.Pattern = "\w"
for i = 128 to 65535
c = ChrW(i)
If rx.Test(c) Then WScript.Echo i,c
next

The answer is pretty whacky - exactly 1 character is a letter, the following one (if you can see it):
I
That's the "Latin Capital Letter I with Dot Above", which has a character code of 304 (0x130).

If you read the documentation, the definition for the \w sequence says
"Matches any word character including underscore. Equivalent to '[A-Za-z0-9_]'."
so it looks like it's roughly correct, with the exception of the dotted-capital I.

In any case, what it apparently comes down to is that the VBScript regex engine is NOT going to work for arbitrary high character sets. This is going to require an engine that does support Unicode for a robust solution.

"Paul Randall" <paulr90@xxxxxxx> wrote in message news:uvNGNf2oJHA.1172@xxxxxxxxxxxxxxxxxxxxxxx
Hi, Alex
I'm thinking that too. I don't know Unicode very well, but I'm thinking that a particular Unicode code point might be a special character in one LCID but not in some other LCID. Or maybe it is not LCID dependent. Knowing some code point/LCID combinations would make it easy to look them up in the code charts at Unicode.org.

-Paul Randall

"Alex K. Angelopoulos" <aka(at)mvps.org> wrote in message news:O1JGPBxoJHA.1288@xxxxxxxxxxxxxxxxxxxxxxx
Could you post the unicode character codes for a few sample characters you do want filtered and don't want filtered out so I can check something? I'm guessing that the character set doesn't match high unicode characters properly. It's theoretically possible to do filtering for a range of character codes, but I'm suspicious that you may also have special characters in those high ranges that you want filtered out and using a simple range won't work for that. Some example characters might help me check the possibilities.

"Gabriela" <frohlinger@xxxxxxxxx> wrote in message news:fb3d5f5e-6962-40b7-a8bb-f8a6c822923c@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On Mar 8, 9:50 pm, Gabriela <frohlin...@xxxxxxxxx> wrote:
On Mar 8, 6:53 pm, "Alex K. Angelopoulos" <aka(at)mvps.org> wrote:



> You haven't explained precisely what you mean by "special > characters", but
> I'm guessing you mean anything that is not a word character and is > not a
> space. For that, try this regular expression:

> oreg_exp.Pattern = "[^ \w]"

> "Gabriela" <frohlin...@xxxxxxxxx> wrote in message

>news:c3a12250-0107-45b8-a580-1668ca7c7119@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

> > Hi,
> > I am trying to write a function that receives strings from all > > unicode
> > characters - and replaces all special characters (!@#$%^&*()><?...)
> > with "-", but literals as is.
> > I've tried to use regular expression with all special chars on > > ASCII
> > table - but it still doesn't cover everything.
> > I cannot use a "whitelist" (literals that are allowed) instead of a
> > "blacklist" (special chars NOT allowed) - because I don't know the
> > letters of all languages I'd like to support (English, Latin, > > Chinese,
> > Arabic...).
> > Any ideas what I can do?
> > Thanks,
> > Gabi.

> > This is my "black list" regular expression code - but it does not
> > succeed always...

> > dim oreg_exp
> > set oreg_exp = new RegExp
> > 'oreg_exp.Pattern = "[^a-z0-9]"
> > oreg_exp.Pattern = "([{}\(\)\^$&._%#!@=<>:;,~`'\' > > \*\?\/\+\|\[\\\\]|
> > \]|\-)"
> > oreg_exp.IgnoreCase = true
> > oreg_exp.global = true
> > title = oreg_exp.replace (title,"-")
> > Set oreg_exp = Nothing

That's all I needed. the little "[^ \w]" - thanks a lot!!

Ohhh, no, this helps only for English chars. When I've tried it with
other language's - it didn't work - they were removed by regular
expression. I need to support all literals in all languages...
Gabi.



.



Relevant Pages

  • Re: Replace special characters
    ... by VBScript's regular expression engine. ... I don't know a great deal about Unicode - just some things from mucking ... around with high-order special characters a few years ago. ... saying is actually rooted in the regular expression engine, ...
    (microsoft.public.scripting.vbscript)
  • Re: Replace special characters
    ... by VBScript's regular expression engine. ... around with high-order special characters a few years ago. ... I don't know Unicode very well, ...
    (microsoft.public.scripting.vbscript)
  • Re: Replace special characters
    ... by VBScript's regular expression engine. ... around with high-order special characters a few years ago. ... saying is actually rooted in the regular expression engine, ... to require an engine that does support Unicode for a robust solution. ...
    (microsoft.public.scripting.vbscript)
  • Re: Check for multi-byte characters in Regular Expression
    ... If I just take any random Japanese characters, ... The 16-bit Unicode ... All characters in a string variable in VBScript are stored as 16-bit ... then a regular expression can easily ...
    (microsoft.public.scripting.vbscript)
  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)