Re: Tokenizing a large buffer

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



On Tue, 18 Dec 2007 10:38:24 -0800, Samuel R. Neff <samuelneff@xxxxxxxxxx> wrote:


If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.

In addition, how long can a Regex match string be? Assuming there's no practical limit -- that is, you can put any arbitrary string there -- then since Regex supports boolean "or" in the search pattern string, you could just have a single search pattern string with all of the tokens in it.

So rather than looping with multiple searches using Regex, just loop on the tokens in creating the search string. Then let Regex do all the hard work.

Would this be as fast or faster than the state graph? I don't know...it depends on whether the Regex authors put some effort into optimizing that case. I don't know enough about Regex (implementation _or_ API :) ) to have an answer to that. But even if they didn't, obviously Sam and I agree that the simpler code is better as long as there's no direct evidence that performance is actually going to be an issue.

Pete
.



Relevant Pages

  • Re: Fast way to determine if a string contains a member of a list of strings
    ... I don't know if RegEx has optimizations for dealing with this sort of ... but given how long a search string for RegEx could be if ... I suppose if I'm going to keep referring people to it, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Fastest way to search a string for the occurance of a word??
    ... but the OP's question was what's the "Fastest way to search a string ... in all the tests I did here, the Regex was by far superior. ... However, of course, if you've got new regular expressions all ... Sure - but just that extra Match object could be relevant if the search ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: regular expression help
    ... Basically because if you remove everything that is optional in the regex below you end up with an empty regex: ... So the regex engine will try to match on every character in the string: ... , comma doesn't match, but the nothingness in front of it does. ... A quote followed by any sequence of characters that is not a quote, ...
    (microsoft.public.dotnet.framework)
  • Re: Regex optimization
    ... I was hoping that someone with knowledge of the Regex engine could ... match per string for either Regex. ... reluctant modifier, may be slower .*?, +? ... Variable parts will try to capture as much as possible. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Regex Capture problem
    ... "learned" my regex using a freeware utility that had slightly different ... was trying to capture instead of. ... I have used Regex utilities before, so I understand the concepts of text ... Function RESub(str As String, SrchFor As String, ReplWith As String) As String ...
    (microsoft.public.excel.programming)