Re: Regular Expression, to use or not to use...

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: Tom (junkmale48_at_hotmail.com)
Date: 08/22/04


Date: 22 Aug 2004 09:53:45 -0700


"Niki Estner" <niki.estner@cube.net> wrote in message news:<e7jxrrwhEHA.3612@TK2MSFTNGP12.phx.gbl>...
> "Tom" <junkmale48@hotmail.com> wrote in
> news:63d1c9f0.0408201000.6324ef03@posting.google.com...
> > I have struggled with the issue of whether or not to use Regular
> > Expressions for a long time now, and after implementing many text
> > manipulating solutions both ways, I've found that writing specialized
> > code instead of an RE is almost always the better solution. Here is
> > why....
>
> > RE's are complex.
>
> Yesterday somebody asked a string matching question of the C# ng. I've
> copied two of the answers here:
> The C# version:
>
> // The start Position
> int Start = InputName.IndexOf(":")+1;
> // The end position
> int End = InputName.IndexOf(" T");
> // The output
> string OutputName = InputName.Substring(Start,End - Start);
> // Clear off leading / trailing spaces
> OutputName = OutputName.Trim();
>
>
> The Regex version:
> Match m = Regex.Match(inputString, "From: (.*) To:");
> if (m.Success)
> OutputName = m.Groups[1];
>
> I may be a bad C# programmer, but if I read the former piece of source code
> I wouldn't have a clue of what it could be doing. Although every line's
> commented, and the code is rather simple.
> The RE version on the other hand seems pretty self-explaining to me.
>
> > Plus it is not like writing this "one line" of code is saving you any
> time.
>
> It does.
> Look at the above code snippets and think about how much time the poor
> plain-C#-guy will spend debugging, trying to figure out why people like
> "James T. Kirk" don't get their email, and why his program sometimes simply
> crashes...

Okay there are various problems with this example.

First of all this is one of the extremely trival examples that I might
actually use a Regular expression for. Consider a RE like this

^(?:(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26
])|(?:(?:16|[2468][048]|[3579][26])00)))(\/|-|\.)(?:0?2\1(?:
29))$)|(?:(?:1[6-9]|[2-9]\d)?\d{2})(\/|-|\.)(?:(?:(?:0?[1357
8]|1[02])\2(?:31))|(?:(?:0?[1,3-9]|1[0-2])\2(29|30))|(?:(?:0
?[1-9])|(?:1[0-2]))\2(?:0?[1-9]|1\d|2[0-8]))$

And tell me what it does? Better yet it doesn't work 1% of the time
tell me why. This is an actual RE btw. You have to imagine the person
that made this didn't type it up and have it work in one shot, in a
couple of seconds. Sure you can use a 3rd party tool to decipher this
and costs extra money and that nobody else probably has or is familiar
with but why? So I have something that will search the string 10x
slower?

Next the C# code is incorrect. It should really be this.

   // The start Position
    int Start = InputName.IndexOf("From:")+"From:".Length;
    // The end position
    int End = InputName.IndexOf("To:",Start);
    // The output
    string OutputName = InputName.Substring(Start,End - Start);
    // Clear off leading / trailing spaces
    OutputName = OutputName.Trim();

What's hard to understand about these 3 lines of code? The methods are
clearly named and better yet if you really don't understand it step
through it and see. Which brings me back to my point at least if that
code wasn't working I could easly find out why.

Next your RE is broken too. Consider the following input.
From:John To:Mike
From: JohnTo:Mike
From:John To:Mike
From: To:Bob
From:To:Bob
From: John To:Mike
From:John To:Mike
From: John To:Mike

Your RE will only match 3 of these strings. Which ones? Sounds like
you need to do some debugging also? I was able to easily debug your C#
code.

for the sake of argument lets correct that too.

 The Regex version:
    Match m = Regex.Match(inputString, "From:(\w*)(.*)(\w*)To:");
    if (m.Success)
      OutputName = m.Groups[1];

Even this simple RE gets a little more complex and Cryptic.

Okay but there is another thing over looked here.

For the input string From:John To:Mike
 The C# code will return the string

"John"

and the RE will return
"From:John To:"

Looks like I need to run my C# code anyway?! Now I have done 2 string
operations and the 2 lines has turned into about six. This is the
extra processing that I refer to with RE's. RE's usually require 2
steps a match then some extra processing on the match. Sure you can do
this with the whole prefix postfix syntax, but again this "simple" RE
is getting more and more complex.

Also neither of these implementation take into consideration really
poorly formated input. With the C# since you are deep in the proceess
you could throw errors like "These is no 'To' after 'For' on line 3",
instead of just missing Matches.

Let's Chalk this up to a bad example and move on.

>
> > I have
> > had really complex RE's that took me hours to write that I needed
> > external tools to help me with. Sure it was only one line of code in
> > the end, but I could have written a page of easy to follow code in
> > half the time. Which brings me to my next point.
>
> Interesting. Could you post such a RE, and the plain code?

Initally I ran into this while building an RE to extract all the links
out of an html page. Because of NDAs and Copyrights I don't feel
comfortable posting the code. But this should be an easy one right? I
know my RE code to do it about a Page, mostly because of having the
post process results. Turns out to be about the same amount of code as
the C# string operations actually, except C# code runs about 10x
faster, because it doesn't have to do any auxilary copying or post
processing. It finds exatacally what it needs at extracts it and never
goes over a single byte in the string twice.

>
> > RE's are not debug-able. If I have a page of well written code I (or
> > anyone else) can easily step through it.
>
> Putting it in a RegEx testing program like Expresso and removing parts of it
> usually does a similar job.

Again now I need a 3rd party tool. I have a good debugging tool for C#
why not use that?

>
> > When I send my 150 char RE
> > into the RE engine it is a black box. I am just left to wonder why it
> > didn't work.
>
> On the other hand, you usually can't copy C# code to some other program and
> simply run it there, because it is coupled with other parts of the project.

It's true I can't copy in a program PERL or something but who cares.
As far as moving C# code around it's no problem why would a string
operation be coupled to anything. The RE code is C# code in the end
too. This is kind of a Mute point.
 
> > ...
> > It's difficult to do some things in RE. RE's work great for some
> > search but not so well for others.
>
> That applies to pretty much any tool I can think of...

Yea thats true, like I said I still use them occasionally, as a hack,
when I don't care about performance and it actually saves me some
code, and I am not really scared what happens when my "untested"
silently fails to match.
 
> > ...
> > Finally I don't know what you guy are saying about RE ever being
> > faster ever. In my experience RE's are slow, very slow. Like on the
> > order of 10 times slower then straight forward string parsing code.
>
> Depends on what you're doing, and how you're doing it. If you are searching
> for a long pattern like "Thomas Jefferson" in a long string (> 100
> characters), the RE is about 10 times faster than a culture-invariant
> IndexOf.

Wrong, I tried that exact example. Please show me where you get this
data.

                        string input="ndvskjdfsnkvjfdsdsfjvkdfsnklvjndfsjkvldfsnjvndfsnjvklfdjslvdfsnklvdfskvdfs
Thomas Jefferson mvkdskl;dfskl;vmdfskvlmfdskldfsmkl;vmfdsmkldfskl;vm";
                        
                {
                        DateTime start=DateTime.Now;
                        Regex rex=new Regex("Thomas Jefferson");
                        for(int i=0;i<1000000;i++)
                        {
                                Match m=rex.Match(input);
                        }
                        DateTime end=DateTime.Now;
                
                        TimeSpan ts=end-start;
                        Console.WriteLine(ts.TotalMilliseconds.ToString());
                }

                {
                        DateTime start=DateTime.Now;
                        for(int i=0;i<1000000;i++)
                        {
                                int imp=input.IndexOf("Thomas Jefferson");
                        }
                        DateTime end=DateTime.Now;
                
                        TimeSpan ts=end-start;
                        Console.WriteLine(ts.TotalMilliseconds.ToString());
                }

The RE took over 2,000ms where the indexof branch took only 300 ms.
This is about 5X slower. If you could show me some code where the RE
was actually FASTER I would love to see it. Or better yet how to fix
this example made I am just missing something.

But again the time doesn't ever usualy come from the Match itself but
what you have to do to the Match like in the From:To: example. The
extra time it take to copy an intermediate string and process that
overshadows all of this. A custom taylored algorythim will just never
do this.

Also this is an extremly simple re, no |'s or complex expressions.
Also most of the time in complex examples I don't use indexof but
rather walk the string char by char and mantain a state machine. Yes
sometimes this is more complex, but this complexity usually is
proportional to the complexity of the RE, and again I can debug it.

So again I'm looking at harder to write and debug and about 5x slower.
 
> > For me this is the final kicker. Originally I went through all the
> > trouble off using RE's in all my complex text paring code because I
> > thought it would faster. This could be a result of .NET engine,
> > regardless I was really disappointed by this causing me to rollback a
> > lot of my solutions.
>
> If it's in the 10% of the code that use 90% of the time, you should of
> course choose the fastest code you can get.

Agreed, this is where usings RE's is just not an option.

 
> > With all this said I still use REs sometimes, but only for really
> > simple string operations where I can come up with the expression in a
> > couple of seconds I don't care about performance and don't feel like
> > writing 5-10 lines of code.
>
> So what?
>
> > To me it is kind of like a quick hack.
>
> A quick hack with full error handling, that's usually much more stable and
> robust that those untested "5-10 lines of code"...

It's a Hack because it DOESN'T have full error handleing and is dog
slow. When it fails, sure it doens't crash, (neither does IndexOf)it
just doesn't return a Match. Which leaves the RE "untested" and in
need of debuging also. If I am concerned about my parsing code
crashing I can wrap it in a try catch block, and just let it pass
though, so it works like the RE failing, but it might actully be
helpfull for it to crash in a debugger so I can see what is going
wrong instead of just guessing.

Again maybe this is just Microsoft's implementation of RE's. If the
the RE's were really fast. This would start to make sense to me. I
wonder if anyone has information on this.

Tom

 
> Niki



Relevant Pages

  • Re: Small confusion about negative lookbehind
    ... > My candidate string is "ab". ... > The expressions I'm testing this string against are the following, ... but the position between characters. ... Regular expressions describe not only strings, ...
    (comp.lang.java.programmer)
  • Re: Why not FP for Money?
    ... >> conversion of binary floats to decimal floats, and the string looks ... >> out of place in numeric expressions. ... > that using 'd' is a compromise to having no way to write ... Carlos Ribeiro ...
    (comp.lang.python)
  • Re: Regular Expression, to use or not to use...
    ... > Expressions for a long time now, ... > external tools to help me with. ... > order of 10 times slower then straight forward string parsing code. ... This could be a result of .NET engine, ...
    (microsoft.public.dotnet.general)
  • RE: speed up string matching
    ... > I need to match an expression and its reverse to a very long string. ... you'd have to merge your expressions somehow - the easiest ... So in order to match a very long string with multiple expressions simultaneously and faster than the matching procedure I have described above I need multiple computers? ...
    (perl.beginners)
  • Is there anywhere I can trap string creation & destruction?
    ... long string operations, is this easy or a hack? ... including some extra data below the string block: ... The extra data might be, for example, a CS or local string pool object. ...
    (borland.public.delphi.language.objectpascal)