Re: Replace special characters

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance




"Gabriela" <frohlinger@xxxxxxxxx> wrote in message
news:c9eebb78-e173-404e-bb55-731c3b000ff9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On Mar 14, 8:28 am, Gabriela <frohlin...@xxxxxxxxx> wrote:
On Mar 13, 7:44 pm, "Alex K. Angelopoulos" <aka(at)mvps.org> wrote:





Actually, it looks like this thing centers around the categories as
defined
by VBScript's regular expression engine.

I don't know a great deal about Unicode - just some things from mucking
around with high-order special characters a few years ago. What you're
saying is actually rooted in the regular expression engine, not the
LCID -
although it could come out to the same thing if a particular regex
engine
uses arbitrary OS APIs to classify characters.

I did some testing, using the following code in WSH to see what
characters
are considered "word characters" by VBScript:

set rx = new regexp: rx.Pattern = "\w"
for i = 128 to 65535
c = ChrW(i)
If rx.Test(c) Then WScript.Echo i,c
next

The answer is pretty whacky - exactly 1 character is a letter, the
following
one (if you can see it):
I
That's the "Latin Capital Letter I with Dot Above", which has a
character
code of 304 (0x130).

If you read the documentation, the definition for the \w sequence says
"Matches any word character including underscore. Equivalent to
'[A-Za-z0-9_]'."
so it looks like it's roughly correct, with the exception of the
dotted-capital I.

In any case, what it apparently comes down to is that the VBScript regex
engine is NOT going to work for arbitrary high character sets. This is
going
to require an engine that does support Unicode for a robust solution.

"Paul Randall" <paul...@xxxxxxx> wrote in message

news:uvNGNf2oJHA.1172@xxxxxxxxxxxxxxxxxxxxxxx

Hi, Alex
I'm thinking that too. I don't know Unicode very well, but I'm
thinking
that a particular Unicode code point might be a special character in
one
LCID but not in some other LCID. Or maybe it is not LCID dependent.
Knowing some code point/LCID combinations would make it easy to look
them
up in the code charts at Unicode.org.

-Paul Randall

"Alex K. Angelopoulos" <aka(at)mvps.org> wrote in message
news:O1JGPBxoJHA.1288@xxxxxxxxxxxxxxxxxxxxxxx
Could you post the unicode character codes for a few sample
characters
you do want filtered and don't want filtered out so I can check
something? I'm guessing that the character set doesn't match high
unicode
characters properly. It's theoretically possible to do filtering for
a
range of character codes, but I'm suspicious that you may also have
special characters in those high ranges that you want filtered out
and
using a simple range won't work for that. Some example characters
might
help me check the possibilities.

"Gabriela" <frohlin...@xxxxxxxxx> wrote in message
news:fb3d5f5e-6962-40b7-a8bb-f8a6c822923c@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On Mar 8, 9:50 pm, Gabriela <frohlin...@xxxxxxxxx> wrote:
On Mar 8, 6:53 pm, "Alex K. Angelopoulos" <aka(at)mvps.org> wrote:

You haven't explained precisely what you mean by "special
characters", but
I'm guessing you mean anything that is not a word character and
is
not a
space. For that, try this regular expression:

oreg_exp.Pattern = "[^ \w]"

"Gabriela" <frohlin...@xxxxxxxxx> wrote in message

news:c3a12250-0107-45b8-a580-1668ca7c7119@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Hi,
I am trying to write a function that receives strings from all
unicode
characters - and replaces all special characters
(!@#$%^&*()><?...)
with "-", but literals as is.
I've tried to use regular expression with all special chars on
ASCII
table - but it still doesn't cover everything.
I cannot use a "whitelist" (literals that are allowed) instead
of a
"blacklist" (special chars NOT allowed) - because I don't know
the
letters of all languages I'd like to support (English, Latin,
Chinese,
Arabic...).
Any ideas what I can do?
Thanks,
Gabi.

This is my "black list" regular expression code - but it does
not
succeed always...

dim oreg_exp
set oreg_exp = new RegExp
'oreg_exp.Pattern = "[^a-z0-9]"
oreg_exp.Pattern = "([{}\(\)\^$&._%#!@=<>:;,~`'\'
\*\?\/\+\|\[\\\\]|
\]|\-)"
oreg_exp.IgnoreCase = true
oreg_exp.global = true
title = oreg_exp.replace (title,"-")
Set oreg_exp = Nothing

That's all I needed. the little "[^ \w]" - thanks a lot!!

Ohhh, no, this helps only for English chars. When I've tried it with
other language's - it didn't work - they were removed by regular
expression. I need to support all literals in all languages...
Gabi.- Hide quoted text -

- Show quoted text -

Hi,

I can't copy&paste to this post the string that is giving me trouble,
because when I do that, the problematic apostrophes are replaced with
regular special chars, and it fixes the problem.
So look at this linkhttp://www.grazeit.com/test_str.asp- it contains the
problematic
special chars from my DB, with the failed conversion, whereas after
copy&paste the string from DB, and pasting it back in the code -
conversion is OK.
Thanks,
Gabi.- Hide quoted text -

- Show quoted text -


Hi Alex,
By saying "This is going to require an engine that does support
Unicode for a robust solution" - you mean that I'll need another
component, instead of VB regexp, which does know to handle with all
unidoce characters?
Do you know such? If not, any idea where I can find such?
Thanks again,
Gabi.

---------------------------------

Hi, Gabi,

After looking at the Unicode charts, I think you will be able to handle this
task with a blacklist which VBScript's regular expression engine should be
able to handle just fine.

Here is an index to the 65K unicode range of character code points hex 0000
to FFFF:
http://unicode.org/charts/nameslist/mainList.html
This page has links to HTML files that are set up to display all of the
characters in a range of code points. Of course, the only characters that
can actually be displayed are the ones for which your computer has a font
that contains the glyph for the code point. These URLs have the following
format:
http://unicode.org/charts/nameslist/c_0000.html
http://unicode.org/charts/nameslist/c_0080.html
http://unicode.org/charts/nameslist/c_0100.html
http://unicode.org/charts/nameslist/c_0180.html
http://unicode.org/charts/nameslist/c_0250.html

There also exist a similar set of URLs to PDF files that display ALL the
defined characters in the came codepoint ranges (your fonts are not used).
I don't have the URL of the index for this set of PDFs, but the format of
the PDF file names is similar to the HTML charts above. For example:
http://www.unicode.org/charts/PDF/U0000.pdf - Characters 0 to 7F hex
http://www.unicode.org/charts/PDF/U0080.pdf - Characters 80 to FF hex
http://www.unicode.org/charts/PDF/U0100.pdf - Characters 100 to 17F hex
http://www.unicode.org/charts/PDF/U0100.pdf - Characters 180 to 24F hex
http://www.unicode.org/charts/PDF/U0250.pdf - Characters 250 to 2AF hex
So, the HTML index gives you the info to get to the corresponding PDF file.

The Unicode standard gives a name to each of these code ranges. I'm
thinking that each of the characters you want your script to consider to be
one of the 'special' characters and that you want to convert to a space
Chr(32) character will be the same no matter what locale/language is
involved. So you should be able to set up a blacklist of the specific
Unicode code points; locale/language probably won't enter into what you want
to find or how the regular expression will work.

The scripting help file tells you how to represent individual Unicode code
points in a regular expression:
\un
Matches n, where n is a Unicode character expressed as four hexadecimal
digits. For example, \u00A9 matches the copyright symbol (©).

HTH,
-Paul Randall


.



Relevant Pages

  • Re: Replace special characters
    ... I don't know a great deal about Unicode - just some things from mucking around with high-order special characters a few years ago. ... What you're saying is actually rooted in the regular expression engine, not the LCID - although it could come out to the same thing if a particular regex engine uses arbitrary OS APIs to classify characters. ...
    (microsoft.public.scripting.vbscript)
  • That makes sense to me
    ... concluding that it is a unicode file. ... is there a way to have Excel import these characters ... I've substituting + for spaces and y for the special characters above. ... "Tom Ogilvy" wrote: ...
    (microsoft.public.excel.programming)
  • Re: Regular Expression Function
    ... I want a regular expression to compare sentences and then rate them as ... I have an array with a list of other phrases like so... ... characters will throw things off. ... "In an hour the system will go down for maintenance". ...
    (alt.php)
  • Re: Regular Expression Function
    ... I want a regular expression to compare sentences and then rate them as ... I have an array with a list of other phrases like so.. ... These will be stripped from the input first. ... characters will throw things off. ...
    (alt.php)
  • Re: Expert script (.bat) writers help needed (strip double-quote from string)
    ... Sets or returns the regular expression pattern being searched for. ... Always a RegExp object variable. ... May include any of the regular expression characters defined in the table in the Settings section. ...
    (microsoft.public.windowsxp.help_and_support)