Re: WildCard Key



Ben <benm5678@xxxxxxxxx> wrote:
I have a std::list that holds a list of disallowed filenames. So when
the client creates a file, the service checks if it's in this list and
if so, it deletes the file. Currently, it's fast and efficient, in a
second, I can check if the key (the filename) exists in the list and
act accordingly if so.

But I have a new need to support filenames based on wildcards. So the
list may contain notepad*.exe and is supposed to block anything that
matches.

Any ideas of an efficient way to do this ?

I'd use a prefix tree aka trie (http://en.wikipedia.org/wiki/Trie) both
for the original problem and for the one with wildcards. It's a little
bit trickier for the latter, as you need to simulate a non-deterministic
automaton, but not much.

So you build your graph as shown in the wiki article mentioned above. In
wildcard case, some edges will be labeled ? or *, in addition to edges
labeled with normal letters. No two edges coming out of the same vertex
have the same label (this is true for ? and *, too). Each pattern traces
a path through the graph (follow edges marked with successive letters of
the pattern): a vertex where such a path ends is called "terminal"
(marked with numbers in wiki article).

You now need to determine whether incoming string "matches" a root of
the tree, following these rules:

1. An empty string matches a terminal vertex

2. A non-empty string xA (that is, a string whose first character is x
and the rest is A) matches a vertex V if there is an edge labeled 'x'
going from V to W, and A matches W

3. A non-empty string xA matches vertex V if there is an edge labeled
'?' going from V to W and A matches W.

4. A string A matches vertex V if there is an edge labeled '*' going
from V to W and some suffix of A matches W.


The beauty of the prefix tree appoach is that the search time is
independent of the number of patterns. In the absence of wildcards, the
algorithm is trivial - you just use the incoming string to trace the
path through the tree by following labeled edges, until you arrive at a
terminal vertex (or not, in which case there's no match). In the
presence of wildcards, you would need to backtrack: first, at any given
vertex more than one rule may apply, and you may need to try them all;
second, labels with * are non-deterministic - just like in your WildCmp
function, you would need to start by assuming * consumes zero
characters, then retry with one, with two and so on. Just as with
WildCmp, you only need to retry the last * you encounter.

An easy optimization: if a pattern has a trailing *, mark the vertex
which the corresponding '*' edge comes out of as "superterminal". Any
string (empty or otherwise) matches a superterminal vertex. Another
optimization would be to collapse any sequence of question marks and
stars that contains at least one star, to a single star.

Igor Tandetnik


.



Relevant Pages

  • Re: Help configuring Symantec NIS ad blocking - tech support is worthless
    ... > Wildcards do not work when adding strings to the NIS ad blocking list. ... finally issued a specific response. ... will match for the length of the supplied rule string. ...
    (comp.security.firewalls)
  • Re: Find and Replace
    ... wildcards but will just have to sit down and get to grips with them through ... Greg Maxey/Word MVP ... MarkN wrote: ... Again <s anchors the start of a find string to a word starting with ...
    (microsoft.public.word.docmanagement)
  • Re: [PHP] Slow searches in large database
    ... LIKE putting in wildcards for and aft e.g %string% slows down queries ... > We have a mysql database that has approximately 20,000 records and has a ...
    (php.general)
  • ICE + Javascript validation script != success
    ... I have been struggling for days to solve this problem and every ... I have lots of pages setup that query correctly and work really ... the form and add the wildcards for the user. ... send_string before the submit, it has wildcards in the string as I ...
    (comp.databases.ingres)
  • Re: Performance of string deserialisation
    ... I like this idea of calculating a hash value from the byte sequence and ... using that as a key in some sort of dictionary. ... concept as the prefix tree you mention -- a structure where common subsets ... reach the string value at a leaf. ...
    (microsoft.public.dotnet.framework.performance)