Re: Fastest way to search a string for the occurance of a word??

From: Niki Estner (niki.estner_at_cube.net)
Date: 08/11/04


Date: Wed, 11 Aug 2004 17:37:53 +0200


"Jon Skeet [C# MVP]" <skeet@pobox.com> wrote in
news:MPG.1b8444cfa741b7d398b12b@msnews.microsoft.com...
> ...
> Well, you're using "whatever case we happen to get" performance -

Performing thousands of tests over random data, adding up the results:
That's average performance in my understanding.
You may have read somewhere that quicksort has O(n*log n) performance in the
average, O(nē) in the worst case. Average case = sort random data; Worst
case = sort data that is already sorted.

> without knowing anything about the real world data, we don't know what
> "average-case" performance is going to be at all.

???
You think you can't say e.g. whether bubblesort on the average performs
better/equally/worse than quicksort without knowing what data will be
sorted?
They do teach that topic differently where I come from...

> For instance, where
> did you get "1234" from? The results change very dramatically if you
> change it to "123" for instance, presumably because "123" is found
> fairly early in the string, which is where String.IndexOf does better.
> (Searching for "123" gives results of 2.89s for String.IndexOf, and
> 4.52s for regular expressions, for a million iterations (one test
> only.))

Again, on my PC, Regex is 5 times faster.
BTW: I've tested that on my laptop, too. Same results.

> ...
> I'm not sure that "corrected" is the right word for what I did -
> certainly "commented on".

Then I must have misunderstood the sentence "I usually find the exact
opposite". To me that sounded a lot like a correction...

> ...
> If I'd had the same results you'd had, I would probably have
> recommended REs, yes - but definitely with a caveat that it's worth
> checking that this really is the performance bottleneck, as IndexOf is
> more readable and in most cases has "good enough" performance. When
> advocating any "more complex" code for performance reasons, I tend to
> put such a caveat in. (I consider regular expressions more complex as
> you need to be careful about the search string - if it contains any
> characters that are "special" for regexes, you may need to escape them,
> for instance.)

That *was* my advice. I *did* tell him to find out what's the bottleneck of
his app, any do a few performance tests if string matching turns out to be
that bottleneck.

> Having looked at my reply to you, btw, I noticed I'd used the phrase
> "hard-coded", and so I thought I'd apply that to our particular test. I
> assume this is pretty much what Boyes-Moore does:
> ...
>
> (I think that's correct, isn't it?)

I think it should read:

  static int Find1234 (string haystack)
  {
    int index=3;
    int max = haystack.Length-3;
    while (index < max)
    {
      switch (haystack[index])
      {
        case '1':
          if ((haystack[index+1] == '2') &&
              (haystack[index+2] == '3') &&
              (haystack[index+3] == '4'))
            return index;
          break;
        case '2':
          if ((haystack[index-1] == '1') &&
            (haystack[index+2] == '3') &&
            (haystack[index+3] == '4'))
            return index;
          break;
        case '3':
          if ((haystack[index-2] == '1') &&
            (haystack[index-1] == '2') &&
            (haystack[index+3] == '4'))
            return index;
          break;
        case '4':
          if ((haystack[index-3] == '1') &&
            (haystack[index-2] == '2') &&
            (haystack[index-1] == '3'))
            return index;
          break;
      }
      index+=4;
    }
    return -1;
  }

On my PC, that's about 20-30% faster than a compiled Regex.
But I guess if the string was variable, and the function had to use
table-lookups instead of constants, it would probably be slower.

> I haven't tried to optimise it any further, but it already just about
> out-performs the other ways in a few test cases. (That's not using
> compiled reg-exes, admittedly - just your original test code.)

Yep, that was why I suggested that in my original reply.

Niki



Relevant Pages

  • Re: New form of recursive compression tested and proven. Now how do I market it?
    ... >> Tom, my claim is that I can recursively reduce random data. ... > bit can only represent half as many values as a 2-bit string. ... > So now that I've disproven that ANY FORM of random data compressor is ...
    (comp.compression)
  • Re: New form of recursive compression tested and proven. Now how do I market it?
    ... bit can only represent half as many values as a 2-bit string. ... I'm beginning to believe that there is no such thing as Random Data. ... that simply emits all 1024-bit integers will need to have at least ... determine the output which means you either can't compress all N-bit ...
    (comp.compression)
  • Split passwords
    ... (so it looks like one long string of letters and numbers). ... secure as long as the file of random data remains secure. ... User must protect the random data. ... An OS with bad security might permit another ...
    (sci.crypt)
  • Re: hash and __eq__
    ... chaining (like Python dicts) described as Orather than O, ... Quicksort as Oinstead of O, ... default it usually refers to the average case, using random data. ...
    (comp.lang.python)
  • Re: C Sharp sorting considered superior to C by an order of magnitude
    ... Just for the sake of it, why did you only implement Oalgorithms ... so a fast quicksort is O. ... For random data, a comparison based ...
    (comp.programming)