Re: Regular Expression, to use or not to use...
From: Niki Estner (niki.estner_at_cube.net)
Date: 08/22/04
- Previous message: Ray Valenti: "Re: Dialog OK button"
- In reply to: Tom: "Re: Regular Expression, to use or not to use..."
- Messages sorted by: [ date ] [ thread ]
Date: Sun, 22 Aug 2004 21:30:18 +0200
"Tom" <junkmale48@hotmail.com> wrote >
> ...
> First of all this is one of the extremely trival examples that I might
> actually use a Regular expression for. Consider a RE like this
>
> ^(?:(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26
> ])|(?:(?:16|[2468][048]|[3579][26])00)))(\/|-|\.)(?:0?2\1(?:
> 29))$)|(?:(?:1[6-9]|[2-9]\d)?\d{2})(\/|-|\.)(?:(?:(?:0?[1357
> 8]|1[02])\2(?:31))|(?:(?:0?[1,3-9]|1[0-2])\2(29|30))|(?:(?:0
> ?[1-9])|(?:1[0-2]))\2(?:0?[1-9]|1\d|2[0-8]))$
>
> And tell me what it does? Better yet it doesn't work 1% of the time
> tell me why. This is an actual RE btw. ...
I didn't have a closer look at it, probably it does some numeric range
check. (wild guessing). Doesn't look as if using regular expressions for
that problem was a smart decision.
If that's the kind of problem you think of, I'd agree with you. Conventional
string parsing would be the better choice here.
(Books about RE's usually contain samples as these just to show what it
*possible* with them.)
> Sure you can use a 3rd party tool to decipher this
> and costs extra money and that nobody else probably has or is familiar
> with but why?
I use Expresso and Regulator.
Both are free and quite common.
> Next the C# code is incorrect. It should really be this.
>
> // The start Position
> int Start = InputName.IndexOf("From:")+"From:".Length;
> // The end position
> int End = InputName.IndexOf("To:",Start);
> // The output
> string OutputName = InputName.Substring(Start,End - Start);
> // Clear off leading / trailing spaces
> OutputName = OutputName.Trim();
>
> What's hard to understand about these 3 lines of code?
Counting them, apparently ;-)
> The methods are
> clearly named and better yet if you really don't understand it step
> through it and see. Which brings me back to my point at least if that
> code wasn't working I could easly find out why.
I can only guess you never spent hours debugging for an error that's not
reproducible? The code above will simply crash if the input is malformed.
Suppose the input was coming in on some kind of server application; It'll
run fine in all your test environments, while your customer will complain
that it constantly crashes twice a day...
> Next your RE is broken too. Consider the following input.
> From:John To:Mike
> From: JohnTo:Mike
> From:John To:Mike
> From: To:Bob
> From:To:Bob
> From: John To:Mike
> From:John To:Mike
> From: John To:Mike
>
> Your RE will only match 3 of these strings. Which ones? Sounds like
> you need to do some debugging also? I was able to easily debug your C#
> code.
Now *you* are guessing: You don't know what the strings are supposed to look
like. I wouldn't be surprised if the spaces were actually required.
> ...
> Okay but there is another thing over looked here.
>
> For the input string From:John To:Mike
> The C# code will return the string
>
> "John"
>
> and the RE will return
> "From:John To:"
Use grouping parantheses.
That's not an argument; It's ignorance.
> ...
> Also neither of these implementation take into consideration really
> poorly formated input. With the C# since you are deep in the proceess
> you could throw errors like "These is no 'To' after 'For' on line 3",
> instead of just missing Matches.
You could, which would make those 4 lines to a few hundred, if you really
considered every possible error.
However, if we're comparing apples with apples, the C# version simply
crashes if the input is malformed.
> Let's Chalk this up to a bad example and move on.
It's a common everyday example...
Just to extend it a little further: suppose that string matching code would
have to be localized... Say, to a left-to-right reading language...
> ...[no counterexample]...
> >
> > > RE's are not debug-able. If I have a page of well written code I (or
> > > anyone else) can easily step through it.
> >
> > Putting it in a RegEx testing program like Expresso and removing parts
of it
> > usually does a similar job.
>
> Again now I need a 3rd party tool. I have a good debugging tool for C#
> why not use that?
Do you use a bitmap editor?
A dialog editor?
A zip or other compression tool?
A tooth-brush???
Or do you do all these tasks with your debugger?
I don't see your point...
Different tools for different purposes.
> ...
> It's true I can't copy in a program PERL or something but who cares.
> As far as moving C# code around it's no problem why would a string
> operation be coupled to anything. The RE code is C# code in the end
> too. This is kind of a Mute point.
"Reusability"; ever heared the word before?
To give you a clearer example:
private With ParseWithStatement(){
Context withCtx = this.currentToken.Clone();
AST obj = null;
AST block = null;
this.blockType.Add(BlockType.Block);
try{
GetNextToken();
if (JSToken.LeftParen != this.currentToken.token)
ReportError(JSError.NoLeftParen);
GetNextToken();
this.noSkipTokenSet.Add(NoSkipTokenSet.s_BlockConditionNoSkipTokenSet);
try{
obj = ParseExpression();
...
Do you think you could simply copy that kind of code somewhere and debug it,
or reuse it? (BTW: This is real string parsing code)
> > > ...
> > > Finally I don't know what you guy are saying about RE ever being
> > > faster ever. In my experience RE's are slow, very slow. Like on the
> > > order of 10 times slower then straight forward string parsing code.
> >
> > Depends on what you're doing, and how you're doing it. If you are
searching
> > for a long pattern like "Thomas Jefferson" in a long string (> 100
> > characters), the RE is about 10 times faster than a culture-invariant
> > IndexOf.
>
> Wrong, I tried that exact example. Please show me where you get this
> data.
public static void Main()
{
string searchString = "Thomas Jefferson";
string bigString;
StringBuilder builder = new StringBuilder();
const int stringLength = 100;
const int repeatCount = 100000;
Random rnd = new Random(1234);
for (int i=0; i<stringLength; i++)
builder.Append((char)(rnd.Next()%10+'0'));
bigString = builder.ToString();
CompareInfo info = CultureInfo.InvariantCulture.CompareInfo;
{
Console.WriteLine("Testing String.IndexOf:");
long start_time = DateTime.Now.Ticks;
for (int i=0; i<repeatCount; i++)
{
int index = info.IndexOf(bigString, searchString);
}
long end_time = DateTime.Now.Ticks;
Console.WriteLine(new TimeSpan(end_time - start_time));
}
{
Console.WriteLine("Testing Regex.Match:");
Regex regex = new Regex(searchString);
long start_time = DateTime.Now.Ticks;
for (int i=0; i<repeatCount; i++)
{
Match m = regex.Match(bigString);
}
long end_time = DateTime.Now.Ticks;
Console.WriteLine(new TimeSpan(end_time - start_time));
}
Console.ReadLine();
}
You may play with the constants: RE's use the Boyer-Moore algorithm which is
about O(N/M), while ordinary a String.IndexOf is about O(N). The longer the
strings in question are, the faster the regex gets compared to IndexOf.
> But again the time doesn't ever usualy come from the Match itself but
> what you have to do to the Match like in the From:To: example. The
> extra time it take to copy an intermediate string and process that
> overshadows all of this. A custom taylored algorythim will just never
> do this.
We've had that.
Look up "grouping paranthesis".
> Also this is an extremly simple re, no |'s or complex expressions.
Right. Simple ones are fast ones.
What did you expect?
> Also most of the time in complex examples I don't use indexof but
> rather walk the string char by char and mantain a state machine. Yes
> sometimes this is more complex, but this complexity usually is
> proportional to the complexity of the RE, and again I can debug it.
You're comparing apples with oranges here.
There are many cases where using a state-machine or a recursive parser is
far superior to a RE, noone doubts that. But sometimes RE's are simply the
better solution.
Niki
- Previous message: Ray Valenti: "Re: Dialog OK button"
- In reply to: Tom: "Re: Regular Expression, to use or not to use..."
- Messages sorted by: [ date ] [ thread ]