Re: parsing VB code with a regex



Hi Emby,

Thank you, that was very helpful.

We know that the rules of VB dictate that a comment must be on a single
line, and that it is identified by a single quote that is not surrounded by
double-quotes. That is, if you wish to comment across multiple lines, you
must put a comment marker on each line, essentially creating a separate
comment on each line. Any characters to the right of the comment are
commented out of the code. We also know that the comment may appear at any
point in the line, not necessarily at the beginning.

So, now that we're down to a single string, and a few simple rules:

1. "token" is defined as any character data enclosed by curly brackets.
2. Any token inside a VB comment should be ignored.
3. Any token inside a matching pair of double-quotes should be ignored.
4. All other tokens should be matched.

And I came up with this:

(?m)(?<=^[^']*)'[^\n]+$|(?:"[^"]*"|({[^}]+}))

Let me explain a bit. This regular expression takes advantage of a
characteristic of regular expressions: Regular expressions consume a string
as they are parsed. That is, they move through a string in basically a
"forward-only" manner (other than "backtracking," which is a special case
used in lookarounds mainly). So, if a portion of a string is matched by one
regular expression, it is not available for further matching.

So, I worked backwards from matching tokens to the 2 exceptions where they
should *not* be matched. The token-matching regular expression is simple:

{[^}]+}

Translated, this says a match is a '{' character, followed by any number of
characters that are *not* '}' followed by a '}' character. Simple enough. It
matches every token in the string. Now we want to weed out the
non-qualifying tokens. Since the comment is the one that always weeds
everything out, I left that for last (first). You'll see why in a minute.

The rule for quoted tokens is expressed as follows:

"[^"]*"

It is similar to the first: a double-quote, followed by any number of
characters that are *not* a double-quote, followed by a double-quote.

Now, how do we get these 2 working together? We use the OR operator - '|'.
When we OR these together, we get this:

"[^"]*"|{[^}]+}

This seems to expand the number of matches, since matches are now *added*
that include non-tokens. Here's where the "consuming" aspect comes in. The
matches that match the first rule include matches of tokens inside the
double-quote pairs. So, the only real problem here is separating the 2
groups. So, we use a group (of course!).

"[^"]*"|({[^}]+})

At this point, all tokens are matched, including those inside double-quote
pairs. The only ones that we want are the ones inside "group 1" (the only
capturing group in the regular expression). So, by using that group, we
eliminate the matches inside the double-quote pairs.

We have one last hurdle now. We want to eliminate anything inside a comment.
I left this for last because the comment eliminates *everything* inside it,
including the double-quote pairs, and thus consumes the most of the 3 rules.
This will make the regular expression more efficient, as it has less work to
do with each match.

The rule for comments, again, is a bit more compkicated:

(?m)(?<=^[^']*)'[^\n]+$

First, it must limit a comment to a single line. This is done with the '^'
(start of string/line) and '$' (end of string/line) characters. I also used
the "(?m)" directive, which indicates that the '^' and '$' characters match
at new lines.

So, it begins with a positive look-behind: (?<=^[^']*) which means "the
following is *only* a match if preceded by this regular expression" followed
by the newline character, and a character group which indicates 0 or more
non-single-quotes. The condition applies to the rest of the regular
expression (without the condition matching - lookarounds do not consume) - a
single-quote, followed by 1 or more non-line-break characters, followed by a
line break or the end of the string.

This covers comments which begin in the middle of a line as well as at the
beginning. The lookbehind prevents the characters preceding the single-quote
from being consumed, thereby making them available for the other 2
conditions. I finished up by (1) grouping the second 2 regular expressions
into a single non-capturing group - (?:"[^"]*"|({[^}]+})), making them a
single alternative to the first, and ORing them all together.

In essense, it says, "Match the first (comment) group first. With what is
left over, match either the quoted strings, or the left-over tokens, and put
the left-over tokens into a group." You can do a regular expression match,
and use the values in Group 1 to do your replacements.

I tested it fairly thoroughly. Let me know if it works for you.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.



"Emby" <emby@xxxxxxxxxxxxxxxxxx> wrote in message
news:e3wUNbjiGHA.3496@xxxxxxxxxxxxxxxxxxxxxxx
Hi Kevin,

You are indeed correct. The original question was a more general, "how can
this be done with RE's?"

To be specific, I will have a set of strings, which I will call tokens, each
of which will consist of upper case alpha-numeric characters in curly
brackets. I also have another set of strings which is the translated value
of these tokens. So If I have 5 tokens, I will have 5 translated values, one
for each token.

I will also have a code snippet - a potentially multi-line string - which
will contain embedded tokens. My task is to replace the tokens in the code
snippet string with their translated values. The snippet is a VB code
string. But:

1) any token in a code line to the right of a single quote character which
is not itself in a quoted string should not be replaced
2) any token that is within a quoted string should not be translated

Sorry, but I'm giving examples coz I'm not sure I've described it well or
completely :-)
Known Tokens Translation
{AREA} 7075
{HEIGHT} 2512
{WIDTH} 75
{FOO} "Yes"

Snippet Translated code
If {AREA}>1000 Then If 7075>1000 Then
Return "Large" Return "Large"

' {AREA} token not used ' {AREA} token not used
ElseIf {HEIGHT}>2500 Then ElseIf 2512>2500 Then
Return "Tall" Return "Tall"

ElseIf {WIDTH} >50 Then ElseIf 75 >50 Then
Return " ' " & {FOO} Return " ' " & "Yes"

Else Else
Return " is {FOO} !" Return " is {FOO} !"

End If End If


Our system compiles the resulting snippet on the fly and executes it to
provide the app with a scripting capability.

Thanks for any help you can extend.


"Kevin Spencer" <kevin@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:uwTheiYiGHA.3588@xxxxxxxxxxxxxxxxxxxxxxx
You need to define your rules more exactly. Examples are not specific
enough. For example, you wrote:

I must create a routine that finds tokens in small, arbitrary VB code
snippets. For example, it might have to find all occurrences of
{Formula}

This leaves a lot of room for interpretation, something which computers
are
extremely poor at, and humans not much better. The word "token" simply
means
a series of characters without spaces between them. "{Formula}" as an
example, without any rules, implies nothing. I cannot, for example, assume
that by this example, your tokens will or must always have curly brackets
around them. It does not necessarily imply whether or not spaces may
appear
between the curly brackets (if required) and the characters inside them.

Your example shows:

(1) should find {Area} (both occurrences) and {Height} in this string
If {Area} > 100 Then Return {Area} Else Return {Height}

Again, are the characters always supposed to be surrounded by curly
brackets? Should they have curly brackets at all, or are you just using
them
to "highlight" what you are talking about?

(2) should find {Area}, but not {AreaString} in this string
If {Area} = "{AreaString}" 100 Then Return "Found it!"

What should it match in the following example?
If {Area} = "{Area String}" 100 Then Return "Found it!"

How about this one?

If {Area} = {"Area" "String") Then Return "Found it!"

(3) should find {Height}, but not {Area} in this multi-line string
'the {Area} token is not used here
If {Height} > 1000 Then
Return "Tall"
Else
Return "Short"
End If

Does this mean that it should ignore commented lines?

The first step to writing a regular expression is to define the rules that
comprise the pattern to match. If you can define these rules without any
examples (that is, if the rules are exactly defined, no examples will be
needed), I can write you a regular expression.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Development Numbskull

Nyuck nyuck nyuck


"Mark" <emby@xxxxxxxxxxxxxxxxxx> wrote in message
news:%23Pyl57QiGHA.4080@xxxxxxxxxxxxxxxxxxxxxxx
I must create a routine that finds tokens in small, arbitrary VB code
snippets. For example, it might have to find all occurrences of
{Formula}

I was thinking that using regular expressions might be a neat way to
solve
this, but I am new to them. Can anyone give me a hint here?

The catch is, it must only find tokens that are not quoted and not
commented; examples follow

(1) should find {Area} (both occurrences) and {Height} in this string
If {Area} > 100 Then Return {Area} Else Return {Height}

(2) should find {Area}, but not {AreaString} in this string
If {Area} = "{AreaString}" 100 Then Return "Found it!"

(3) should find {Height}, but not {Area} in this multi-line string
'the {Area} token is not used here
If {Height} > 1000 Then
Return "Tall"
Else
Return "Short"
End If

I've searched many web sites and libraries, but they all seem to be
interested in finding quoted strings, not avoiding them. I'd appreciate
any help.

Emby






.



Relevant Pages

  • Re: RegExp irregularity in JScript
    ... we believe the VBScript Regular Expression class (version 1.0 through ... It does not however, limit the string minimum 4, maximum 8 characters. ... Obviously the first test should test the length of the string, minimum 4, ...
    (microsoft.public.scripting.jscript)
  • Re: Usename regex
    ... Think of a string, ... Regular expression benchmark ... MS MAX AVG MIN DEV INPUT ... If the textbox in question is limited to say 16 characters you'd ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: reading in multiple strings with scanf
    ... If it read a second string it ... It parses tokens from 's', where tokens are substrings separated by ... characters from 'delimiter_list'. ... the second string found in the first string. ...
    (comp.lang.c)
  • Re: Fast search for all positions in a string
    ... It addition to running timing tests in different browsers and on ... direct string comparison (which is unintuitive given the relative ... The otherwise often problematic characteristic of Regular expression ... turns all characters that are significant in regular expressions ...
    (comp.lang.javascript)
  • Re: Regular Expression taking excessive CPU
    ... > regular expression adding so much time to the process, ... > ftIndex is a string variable that typically won't exceed 100 characters. ... static string RemoveNonAlpha1 ...
    (microsoft.public.dotnet.languages.csharp)