Re: Tokenizing a large buffer
- From: "Peter Duniho" <NpOeStPeAdM@xxxxxxxxxxxxxxxx>
- Date: Tue, 18 Dec 2007 10:50:15 -0800
On Tue, 18 Dec 2007 10:38:24 -0800, Samuel R. Neff <samuelneff@xxxxxxxxxx> wrote:
If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.
In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.
In addition, how long can a Regex match string be? Assuming there's no practical limit -- that is, you can put any arbitrary string there -- then since Regex supports boolean "or" in the search pattern string, you could just have a single search pattern string with all of the tokens in it.
So rather than looping with multiple searches using Regex, just loop on the tokens in creating the search string. Then let Regex do all the hard work.
Would this be as fast or faster than the state graph? I don't know...it depends on whether the Regex authors put some effort into optimizing that case. I don't know enough about Regex (implementation _or_ API :) ) to have an answer to that. But even if they didn't, obviously Sam and I agree that the simpler code is better as long as there's no direct evidence that performance is actually going to be an issue.
Pete
.
- Follow-Ups:
- Re: Tokenizing a large buffer
- From: Jonathan Sion
- Re: Tokenizing a large buffer
- References:
- Tokenizing a large buffer
- From: Jonathan Sion
- Re: Tokenizing a large buffer
- From: Samuel R . Neff
- Tokenizing a large buffer
- Prev by Date: Re: Datareader question
- Next by Date: Re: RichTextBox -- I am sooo Confused
- Previous by thread: Re: Tokenizing a large buffer
- Next by thread: Re: Tokenizing a large buffer
- Index(es):
Relevant Pages
|