Re: [newbie] Best way to search for binary data

From: Jerry Coffin (jcoffin_at_taeus.us)
Date: 01/16/05


Date: Sun, 16 Jan 2005 14:19:06 -0700

In article <63CF5D3C-1C93-4F27-9F1B-670F8323DC59@microsoft.com>,
MSalters@discussions.microsoft.com says...

[ string searching ... ]

> Actually, Knuth wrote quite a bit about this. E.g. if pattern starts
> with 5 bytes with value 1, and buffer[4] is a 2, you know you don't
> have to check buffer[0] to buffer[3], and the next element you
> should compare is buffer[9]. Highly non-trivial if you're not
> searching random bytes.

Except in rather specialized circumstances, the Boyer-Moore based
algorithms are usually preferred to the Knuth-Morris-Pratt algorithm.
The primary advantage of the KMP algorithm is that it never requires
any "backwards" processing -- I.e. once you've looked at a particular
byte, you're done with it. If you're processing a stream of data and
can't store much (or any) of the data during the string match, it
works extremely well.

In a situation like this where the data is all in memory, the Boyer-
Moore algorithms tend to work better. The one that's usually
preferred I've seen referred to as Sunday's variant of Boyer-Moore-
Horspool, though I'm not sure Sunday was really the first to invent
it -- quite a few people (myself included) seem to have come up with
it at close enough to the same times that being sure who did it first
would be extremely difficult at best.

Regardless of who originated it, however, the algorithm looks
something like this:

#include <stddef.h>
#include <string.h>
#include <limits.h>

static size_t table[UCHAR_MAX + 1];
static size_t len;
static char *findme;
static size_t secondary;

/*
** Call this first with the string to search FOR.
*/
void init_search(const char *string)
{
      size_t i;

      len = strlen(string);
      for (i = 0; i <= UCHAR_MAX; i++)
            table[i] = len;
      for (i = 0; i < len-1; i++)
            table[(unsigned char)string[i]] = len - i - 1;
      secondary = table[(unsigned char)string[len-1]];
      table[(unsigned char)string[len-1]] = 0;
      findme = (char *)string;
}

/*
** Then call this with a buffer to search IN.
*/
char *strsearch(const char *string)
{
      register size_t shift;
      register size_t pos = len - 1;
      char *here;
      size_t limit=strlen(string);

      while (pos < limit)
      {
            while( pos < limit &&
                  (shift = table[(unsigned char)string[pos]]) > 0)
            {
                  pos += shift;
            }
            if (0 == shift)
            {
                  if (0 == strncmp(findme,
                        here = (char *)&string[pos-len+1], len))
                  {
                        return(here);
                  }
                  else pos+=secondary;
            }
      }
      return NULL;
}

Thanks to Jeff Dunlop, Ray Gardner and Thad Smith, this has fewer
bugs than when I originally wrote it. Any bugs that remain are my
fault. Thanks to Bob Stout, it's also now easier to use than when I
originally wrote it.

Unfortunately, since the last time any of them worked on it, I've
done still more rewriting myself, so I may have undone some of the
good they did, and re-introduced some bugs. OTOH, this code has been
used, studied, and tested for close to 15 years now, so with a little
luck most of the problems may be gone...

-- 
    Later,
    Jerry.
The universe is a figment of its own imagination.


Relevant Pages

  • Re: Finding substring in character array
    ... It works with a character array instead. ... is that there's no way to make one without already having all the characters in an array which then gets copied into the string. ... whereas you could fill all 1 MB with a char[]. ... If he insists on searching a char, i would suggest he reads up on the Boyer-Moore string searching algorithm. ...
    (comp.lang.java.programmer)
  • Re: Searching for byte string in a binary file.
    ... > string and data in a binary file. ... > location and length of the longest match from the left side of bstring. ... would be better to make the first argument "const char buf[255]". ... kind of algorithm to use. ...
    (comp.lang.c)
  • Re: Input file *.txt to dialog box
    ... Here's how you search for a complete string. ... are searching for in a char array. ... current character in the string. ...
    (microsoft.public.vc.mfc)
  • Re: Mimic AES_ENCRYPT and AES_DECRYPT functions in Ruby
    ... def mysql_key ... # The algorithm just creates a 16 byte buffer set to all zero, ... # Runs bitwise XOR for each char on string ...
    (comp.lang.ruby)
  • Re: iterate chars in a string
    ... and I'm searching for a way to iterate every char in ... > char in a string and match it with some known letter. ...
    (comp.lang.ruby)