Re: Regex optimization



Hello Chuck,

"Chris Shepherd" <chsh@xxxxxxxxxxxxxx> wrote in message
news:%23HGhWfEAIHA.968@xxxxxxxxxxxxxxxxxxxxxxx

Chuck B wrote:

Yes I will - but then that won't tell me much about what went on
internally.

I was hoping that someone with knowledge of the Regex engine could
help me understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)

Trying it for yourself and seeing the results is a perfectly
reasonable answer.

I have to disagree here. Trying it for myself will give me numbers but
not understanding which is the untimate goal.

While it may not allow you to recreate the source code, it certainly
will tell you which way is more efficient. Because your original
regexes were completely unequal, it's hard to say whether you could
actually learn anything about what is going on "internally" since in
one instance you'd get a LOT more hits than the other.

Assuming a somewhat normal dataset where some records have "The quick
brown fox" and more records have "The" in them, I'd suggest that it's
probably more work to return more results than it is simply to find
them, so #1 wins out. On 10,000 lines of "The quick brown fox" where
the returned values would be equal, I'd wager on the second regex
being faster (equal time spent returning values, less time spent
searching).

The fault here is mine for not explaining adequately what I wanted.

In the case above I'm looking for the special case where there is 1
match per string for either Regex. For instance; the result of running
both examples above against "09/26/07 The quick brown fox ran away."

The date would probably take just as long for each regex. However, it
seems like it might be more efficient with the static characters to
search for a longer string than a shorter one (assuming that there was
no match embedded inside of another match). The reason for the
increase is that the pointer pointing to the head of the search would
move a greater length after a successful match.

A rule of thumb:
the more variable length parts, the slower *, +, {1,3}, {1,}
reluctant modifier, may be slower .*?, +?
the more look arounds (?<+, (?!, (?=, (?<!, the slower
the more optional parts, the slower
the more choices, the slower
the longer the regex, the slower
the more general parts (eg .*), the slower *if* there is a new part after that.
a short constant part is marginally faster than a large constant part
anchored regexes are faster than ^, $
CultureInvariant regexes are slower
Named captures are slower than unnamed ones.
If you don't use a group for capturing, then it's faster to use (?:...) instead of (...). Unless you specify the ExplicitCapture option.

Why?

Variable parts will try to capture as much as possible. so if you look at the following expresion: A.*B in the following string: "AB1234567890" will first apply the A, then the .* which means it will try to capture AB1234567890. Then it concludes the B won't match and it will remove the last character of the string back to AB123456789, again conclusign it won't match and remove the next character from the capture AB12345678 all the way till it finally returns AB. This is called Backtracking and you usually want to prevent that as much as possible. If you know there should be a relatively short distance between the A and the B use a reluctant modifier .*? or limit the length to search .{0,100} or even make that reluctant .{0,100}?. See my note on reluctant modifiers in the next paragraph.

Reluctant modifiers are generally slower. *unless* you know there is supposed to be a a short distance between this part and the next. Why? Well consider A.*?B with the following input: A1234567890B it will try to match this as follows: it firtst captures A, then sees if there's a B behind it. If there isn't the .*? captures the 1. So now we're ate A1. The engine will again try to see if the next character is a B. It isn't so .*? will capture the next character A12. This is repeated until it captures A1234567890B. Now going back to the previous example. If th einput string there is used this expression is actually faster. Why? Well considering the string AB1234567890. The engine will capture the constant A first. It will then look if there's a B behind it. It is. So the result AB is returned.

The same applies for look arounds. When the engine finds a look around it will start a new expression search from that position. Either in the opposite direction or it will continue from the current position and return afterwards. A look around with a constant or constant length value (eg (?<=AB) or (?<=A.{8}B) will be much faster than a variable length check. Even more so if the variable length check could result in backtracking of itself. (eg woth backtracking option (?<=A.*B) or a normal one (?<=A.*). The more look arounds, the more expressions need to be run. In some cases a look around can make things faster, or are the only way to get the correct result. But try to keep the possible length of the match inside the look around as short as possible.

Optional parts make things slower. Why? Well again has to do with backtracking. There is a chance it needs to revert back if the optional part won't match. This get's worse if the optional part has a variable length.

The more choices teh slower. (A|B|C|D|E|F|G) will try each option one after the other. It will try them in the order yous pecify them. SO if you have a large batch of data you're going to parse in batch. It migth make sense to see which characters occur more often than others and put those up front. Also, if you have multiple parts that begin with the same string it's faster to break doen the problem even faster: (AAA, ABB, ACC, ADD) is slower than (A(AA|BB|CC|DD)). Why? Well because this saves the engine from backtracking back over the A. The same of course applies to [abcdefg] wich is teh equivalent of the regex (?:a|b|c|d|e|f|g).

The longer the regex, the slower. Well that seems logical doesn't it? If you have two almost identical regexes, the longest one will be slower. Just because there are more parts to evaluate.

The more general parts, the slower. This comes back to the first rule. [a-z]* has a more limited amount of characters to match. So teh chances it migth capture the whole input and backtrack all the way back are smaller than when you use .* or even worse [\s\S]*. (why is [\s|S]* worse? you might ask, well that has to do with the fact that by default the .* will stop when it finds a newline.)

A short constant part is faster than a long one. Just because it will take marginally longer to compare 10 characters instead of 5. I'd ignore this. In some cases a larger string can be faster. Just because the chances of findign such a string are more remote, so there are less points in the string a regex migth try to start and find a match. Especially if the large constant is at the front of the string.

Anchored regexes are faster. If you anchor a match either to the beginning ^ or the end of a string $, it'll be much faster. Though adding $ to a regex might in some cases make no difference at all. The ^ just ensures there are a lot fewer places to start searching. Thus it's faster.

CultureInvariant regexes are slower because in that case the engine will try to match each character to it's Unicode equivalents in other language blocks. There might me more representations of the character i in all the unicode characters. So instead of actually searching for i it now searches for [iiiiiiiiii] (each character having another unicode value, but equal representation).

If you use capturing (....) will be faster than (?<a>....). I'm not exactly sure, but I guess is that this is being caused by the fact that the engien will have to do more string comparisons opposed to a simple numeric counter. Do note that for maintainability and readability a named capture is much better, even though it's slower.

Finally, if you're not using a numbered capture, it's faster to let it not capture at all. This can be achieved in two ways. 1. add ?: to the beginning of the group to tell the engine it won't have to store the value it finds. If your regex doesn't use unnamed captures at all, set the ExplicitCapture option to tell it it can ignore any non-named group for capturing. It save you from typing ?: at the start of each group, making the parsing of your regex a tiny bit faster and your regex more readable.

So back to your problem:

From all these rules, only the larger constant part one will apply. And because
there is no very restrictive part after the constant, the shorter pattern will probably win. Be it marginally.

--
Jesse Houwing
jesse.houwing at sogeti.nl


.



Relevant Pages

  • Re: Regex Capture problem
    ... "learned" my regex using a freeware utility that had slightly different ... was trying to capture instead of. ... I have used Regex utilities before, so I understand the concepts of text ... Function RESub(str As String, SrchFor As String, ReplWith As String) As String ...
    (microsoft.public.excel.programming)
  • Re: Regex optimization
    ... I was hoping that someone with knowledge of the Regex engine could ... reluctant modifier, may be slower .*?, +? ... Variable parts will try to capture as much as possible. ... The engine will again try to see if the next character is a B. It ...
    (microsoft.public.dotnet.languages.csharp)
  • regex: capture groups and term binding
    ... I'm building a regex for this string and it's pretty straightforward. ... Only prerequisite is to capture all numbers for later Ruby fun: ... a rule-set from a YAML file to create output. ...
    (comp.lang.ruby)
  • Regular expression captures
    ... String: This action has been executed via Console ... RegEx r = RegEx ... This is where I expected the capture!! ...
    (microsoft.public.dotnet.framework)
  • Re: regular expression help
    ... Basically because if you remove everything that is optional in the regex below you end up with an empty regex: ... So the regex engine will try to match on every character in the string: ... , comma doesn't match, but the nothingness in front of it does. ... A quote followed by any sequence of characters that is not a quote, ...
    (microsoft.public.dotnet.framework)