Re: regular expression question



On Sat, 25 Mar 2006 09:46:08 -0500, "Kevin Spencer"
<kevin@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

Hi Ludwig,

It is not possible to answer your question as you've stated it. Here's why:

i'm using the regular expression \b\w to find the beginning of a word,
in my C# application. If the word is 'public', for example, it works.
However, if the word is '<public', it does not work: it seems that <
is not a valid character, so the beginning of the word starts at
theletter 'p' instead of '<'.

You have not defined your terms. You use the word "word," but you have not
defined what that is supposed to mean in your situation. In regular
expressions, there are no words, only characters. The "\w" character class
indicates a word *character*. A word character is defined in regular
expressions as a character that is either a digit or a letter of the
alphabet.

So, the character '<' is not defined in regular expressions as a word
character, and therefore is not identified as belonging to the set defined
by your rule.

However, while you have stated that you *do* want to identify the character
'<' as the "beginning of a word," you have not stated exactly what the rule
is, only a small part of it. For example, by what you've told me, the
following character sequences could all be "words" -

Hello Ludwig ('H', 'L') The first letters of each word are identified.

Hello, <Ludwig> ('H', '<') The first letter of "Hello" and the beginning '<'
are identified.

Hello, !!!!!!! ('H', '!') The first letter of "Hello" and the beginning '!'
are identified. This is possible because you have not stated what characters
you do *not* consider to be the beginnings of words.

And so on. In other words, a regular expression is shorthand for a rule that
defines a pattern. You need to explicitly define what the rule is in order
for me to create a regular expression that satisfies that rule.

Thanks for the explaination, Kevin!

Well, I'm working on a editor control that supports syntax
highlighting. I have al list of words that should be highlighted when
typed in the editor, for example 'public', 'class', etc.

So at a given time, the user types in the word public, and when the
character 'c' is typed, the word 'public' is colored in blue, for
example.

At the moment, I use the pattern '\b\w' to identify the first
character of the 'word' in the editor, and I use '\w\b' to identify
the last character of a word. This works.

However, there are also xml tags that need to be highlighted; for
example, <sometagname> : if the user types in the '<', it should be
colored; if he then types the last 'e' of 'sometagname', the word
'sometagname' should be colored, if he then types '>', that too should
be colored.

So in fact, each word or character that I define in the list of words,
should be colored.

This list of words can be (for example): public, class, int, long,
byte, byte[], <, >, sometagname, generic<>, etc....

So, if I try to define the rule:
- spaces always define the beginning and end of a word:
public class Test() -> I need to identify the public, class, Test()
- there are characters that are not seperated by spaces but that also
have to be found when typed:
<?xml version="1.0" encoding="utf-8" ?> -> I need to identify the <,
?, xml, version, encoding in order to highlight these in various
colors.

I hope that you understand what I'm trying to do here...

Kind regards,
Ludwig
.



Relevant Pages

  • Re: Extract domain names out of URLs
    ... Match the regular expression below and capture its match into backreference ... Between zero and one times, as many times as possible, giving back as needed ... A character in the range between ?A? ...
    (microsoft.public.excel)
  • Can anyone write this recursion for simple regexp more beautifully and clearly than the braggarts
    ... I know that lisp eval is written more clear than this recursion below ... The Practice of Programming ... The problem was that any existing regular expression package was far ... c Matches any literal character c. ...
    (comp.lang.c.moderated)
  • Re: RegEx: How to ignore the number of whitespaces?
    ... a "simpler" regular expression syntax is likely to bite you eventually, ... but that some of these character sequences may be "marked" as ... This is a regular expression "if" conditional statement, ... do not understand why the pattern "personal computer" will only match ...
    (microsoft.public.dotnet.framework)
  • Re: logcheck.violations.ignore --does not work
    ... Peter T. Breuer wrote: ... it would not take care of it. ... Just use a correct regular expression. ... the period character match any single ...
    (comp.os.linux.security)
  • Re: Regular Expression Help
    ... I then allow for validation routines for the given controls. ... > Let me know if you know what the regular expression would be to limit X ... >>> character it should fail. ...
    (microsoft.public.dotnet.framework.aspnet)