Re: Replace special characters
- From: "Alex K. Angelopoulos" <aka(at)mvps.org>
- Date: Fri, 13 Mar 2009 13:44:29 -0400
Actually, it looks like this thing centers around the categories as defined by VBScript's regular expression engine.
I don't know a great deal about Unicode - just some things from mucking around with high-order special characters a few years ago. What you're saying is actually rooted in the regular expression engine, not the LCID - although it could come out to the same thing if a particular regex engine uses arbitrary OS APIs to classify characters.
I did some testing, using the following code in WSH to see what characters are considered "word characters" by VBScript:
set rx = new regexp: rx.Pattern = "\w"
for i = 128 to 65535
c = ChrW(i)
If rx.Test(c) Then WScript.Echo i,c
next
The answer is pretty whacky - exactly 1 character is a letter, the following one (if you can see it):
I
That's the "Latin Capital Letter I with Dot Above", which has a character code of 304 (0x130).
If you read the documentation, the definition for the \w sequence says
"Matches any word character including underscore. Equivalent to '[A-Za-z0-9_]'."
so it looks like it's roughly correct, with the exception of the dotted-capital I.
In any case, what it apparently comes down to is that the VBScript regex engine is NOT going to work for arbitrary high character sets. This is going to require an engine that does support Unicode for a robust solution.
"Paul Randall" <paulr90@xxxxxxx> wrote in message news:uvNGNf2oJHA.1172@xxxxxxxxxxxxxxxxxxxxxxx
Hi, Alex.
I'm thinking that too. I don't know Unicode very well, but I'm thinking that a particular Unicode code point might be a special character in one LCID but not in some other LCID. Or maybe it is not LCID dependent. Knowing some code point/LCID combinations would make it easy to look them up in the code charts at Unicode.org.
-Paul Randall
"Alex K. Angelopoulos" <aka(at)mvps.org> wrote in message news:O1JGPBxoJHA.1288@xxxxxxxxxxxxxxxxxxxxxxxCould you post the unicode character codes for a few sample characters you do want filtered and don't want filtered out so I can check something? I'm guessing that the character set doesn't match high unicode characters properly. It's theoretically possible to do filtering for a range of character codes, but I'm suspicious that you may also have special characters in those high ranges that you want filtered out and using a simple range won't work for that. Some example characters might help me check the possibilities.
"Gabriela" <frohlinger@xxxxxxxxx> wrote in message news:fb3d5f5e-6962-40b7-a8bb-f8a6c822923c@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOn Mar 8, 9:50 pm, Gabriela <frohlin...@xxxxxxxxx> wrote:On Mar 8, 6:53 pm, "Alex K. Angelopoulos" <aka(at)mvps.org> wrote:
> You haven't explained precisely what you mean by "special > characters", but
> I'm guessing you mean anything that is not a word character and is > not a
> space. For that, try this regular expression:
> oreg_exp.Pattern = "[^ \w]"
> "Gabriela" <frohlin...@xxxxxxxxx> wrote in message
>news:c3a12250-0107-45b8-a580-1668ca7c7119@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > Hi,
> > I am trying to write a function that receives strings from all > > unicode
> > characters - and replaces all special characters (!@#$%^&*()><?...)
> > with "-", but literals as is.
> > I've tried to use regular expression with all special chars on > > ASCII
> > table - but it still doesn't cover everything.
> > I cannot use a "whitelist" (literals that are allowed) instead of a
> > "blacklist" (special chars NOT allowed) - because I don't know the
> > letters of all languages I'd like to support (English, Latin, > > Chinese,
> > Arabic...).
> > Any ideas what I can do?
> > Thanks,
> > Gabi.
> > This is my "black list" regular expression code - but it does not
> > succeed always...
> > dim oreg_exp
> > set oreg_exp = new RegExp
> > 'oreg_exp.Pattern = "[^a-z0-9]"
> > oreg_exp.Pattern = "([{}\(\)\^$&._%#!@=<>:;,~`'\' > > \*\?\/\+\|\[\\\\]|
> > \]|\-)"
> > oreg_exp.IgnoreCase = true
> > oreg_exp.global = true
> > title = oreg_exp.replace (title,"-")
> > Set oreg_exp = Nothing
That's all I needed. the little "[^ \w]" - thanks a lot!!
Ohhh, no, this helps only for English chars. When I've tried it with
other language's - it didn't work - they were removed by regular
expression. I need to support all literals in all languages...
Gabi.
- Follow-Ups:
- Re: Replace special characters
- From: Gabriela
- Re: Replace special characters
- References:
- Replace special characters
- From: Gabriela
- Re: Replace special characters
- From: Alex K. Angelopoulos
- Re: Replace special characters
- From: Gabriela
- Re: Replace special characters
- From: Gabriela
- Re: Replace special characters
- From: Alex K. Angelopoulos
- Re: Replace special characters
- From: Paul Randall
- Replace special characters
- Prev by Date: Re: a stupid question on word.application
- Next by Date: Re: Strange HTA behaviour on newly restored WXP SP1
- Previous by thread: Re: Replace special characters
- Next by thread: Re: Replace special characters
- Index(es):
Relevant Pages
|