Re: Tough (for me) regex case

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: Rob Perkins (rob_perkins_at_hotmail.com)
Date: 04/13/04


Date: Tue, 13 Apr 2004 18:25:12 GMT


[x-posted to m.p.d.f because it concerns the .NET Framework's regex-er
as well...]

"Matt Garrish" <matthew.garrish@sympatico.ca> wrote:

>Does it make a little more sense now why Microsoft's implementation is
>wrong?

I'm not ready to call it "wrong", but I'm getting close. OK, so we
start with:

/(?<!")"(?!")(.*?)(?<!")"(?!")/

Removing the lookahead and lookbehind stuff, (in other words, don't
worry about the paired doublequote case) I get a pattern which reads:

/"(.*?)"/

...which includes the quotes in the match, in the .NET implemenation.
In Perl, the quotes get consumed before the match is constructed. But
if I do this:

/".*?"/

Then the regex matches include the quote characters, in either
implementation. So apparantly in the .NET implementation there is no
semantic difference between the two smaller cases.

And... now it begins to make a bit more sense. One implementor decided
there was no distinction in that difference. Another did.

It makes me wonder if this .NET implementation approach is shared by
other implementations. IOW, is the desirable (for my problem) behavior
unique to Perl 5, or is the undesirable behavior unique to .NET?

TMTOWDI. But it represents a case which works desirably for me under
Perl, and generates a bit more work for me under the .NET Framework's
regex engine.

OK, so that leads me then to a case where this particular regex fails,
even in the Perl implementation. Consider the case of:

The "quick" brown "fox jumped ""over""" the lazy dog.

The desirable matches are:

quick
fox jumped ""over""

but this regex returns only

quick

If I stick whitespace between the second and third quote after "over"
then it returns:

quick
fox jumped ""over""<space>

Again, the plain-english description is "all text between a pair of
doublequote characters, except that paired doublequotes inside a
quoted string are part of the match."

What do you think the regex will be?

Rob



Relevant Pages

  • Re: Tough (for me) regex case
    ... worry about the paired doublequote case) I get a pattern which reads: ... In Perl, the quotes get consumed before the match is constructed. ... Then the regex matches include the quote characters, ...
    (comp.lang.perl.misc)
  • Re: Tough (for me) regex case
    ... > Perl, and generates a bit more work for me under the .NET Framework's ... > regex engine. ... > quoted string are part of the match." ... of quotes enclosed within quotes. ...
    (comp.lang.perl.misc)
  • ruby script hangs on regex match
    ... I have a regex infinte loop kind of problem. ... I try to match this regex against the string, without the quotes (the ... Normally the regex is not supposed to match against this particular string. ... I tried doing the same regex in perl, it complained about the string having the \H unknown control sequence. ...
    (comp.lang.ruby)
  • Re: how to avoid leading white spaces
    ... motivates my use of regexes. ... You are right that for a simple .startwithusing a regex "just ... That's because you know regex syntax. ... I didn't see anything about de-emphasizing them in Perl. ...
    (comp.lang.python)
  • Re: Changing date field in a text file...
    ... > you understand a little more about the Perl language. ... > seeing those quotes anyway. ... Those two digits will be the day number. ... The regular expression is the contents of the first pair ...
    (comp.unix.shell)