Re: Efficient regular expression pattern ?

Tech-Archive recommends: Speed Up your PC by fixing your registry



"Steve B." <steve_beauge@xxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:u6mFuuFkGHA.3620@xxxxxxxxxxxxxxxxxxxxxxx
Your solution seems to be quite effective, but there a scenario where this
cannot apply :

The solution will probably works well for domain filtering, but if I want
to exlude a specific page or sub pages, it won't works.

Let's imagine for example a web site hosting site :
http://www.anygoodprovider.com/asiteofmalicioususer/spywareeldoradodownloads.htm

In this case, the domain is ok, but not the page.


yes, but you get a hit on the domain, so you proceed to the page. If you
make the domain matching efficient, you make 95% of your matching efficient.


I'm thinkg about a "IUrlFilter" :

interface IUrlFilter {
bool TestUrl(string urlToTest);
}

This interface will be implemented twice :

class HostFilter : IUrlFilter

class PageFilter : IUrlFilter


I disagree. I wouldn't use your mechanism for either method.

I firstly test agains hostname (which will probably be 95% of the tests),
and if host is ok, I also test against page (with a very lower number of
test to execute) using the regex method.

Regex can be used not only to match, but also to parse. Use it to parse,
but then match using a data structure.

Use regex to return a list of terms seperated by slashes. Create a tree
structure of all of the offending pages by URL. I'm willing to bet that 80%
of the hits against an offending page will be consumed by less than 5% of
the root elements and less than 30% of the first level elements. In other
words, virus sites will have many virus pages, and hosting providers that
don't care about virus hosting will have many many virus sites.

allow for aliases by making your tree so that you can have references from
multiple roots point to a single tree. This covers the fact that DNS
systems routinely allow for nearly infinite aliases.

hostingprovider.com
subdomain: abc, jbx, sand, zx*
dir: foo
dir: bar
page: viruspage1.html
page: viruspage2.html
page: viruspage3.html
dir: fudge
page: viruspage4.html
hostalias.com -- reference to hostingprovider.com
127.198.41.77 -- reference to hostingprovider.com

One regex can parse your entire URL into parts. Domain, subdomain, a list
of directories, and a page reference, followed by a list of parameters.
So you can match once against every URL, take the parts list and run it
through the tree you create. Simple matching will catch every hit. And it
scales well.

Let's say that you have 10,000 URLs in your list. Let's say that 80% of
them have the same 100 domains. That leaves 20% with a smattering of other
domains. I'll swag that down to about 1000 domains. Let's say that you
create a tree structure like above, but the root is a sorted list of these
1000 domains. Let's say you use my method and a binary search against the
domain name. You can find the domain hit in 10 hits or less... you'll
average five hits. Then, if your directory and page lists follow the same
pattern, you are likely to find your match or free the URL in less than 10
more matches. That's 20 matches. Add another 10,000 URLs to the list. The
number of matches goes up... to 21.

With your method, the first scenario requires 10,000 matches. The second
requires 20,000 matches. It scales EXCEEDINGLY badly.

Please, Steve, use a data structure for this mechanism. The fundamental
concept is a Suffix tree, which is a specialized form of a Trie. Note that
you could use a pure Trie structure, especially a Patricia Trie structure,
if you are clever and want REALLY fast performance, but that may require
some programming chops. You can find a good writeup on Trie structures on
wikipedia. http://en.wikipedia.org/wiki/Trie You could also use a Directed
Acyclic Word Graph, which will give you some very fast performance as well.
To be honest, I may have described a compact DAWG above.

Some other links:
http://www.nist.gov/dads/HTML/suffixtree.html
http://www.nist.gov/dads/HTML/directedAcyclicWordGraph.html

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--


.



Relevant Pages

  • Re: eval in bash vs macro in lisp
    ... The first element is the symbol "defun", ... The last two elements are nil, ... is used to end lists. ... The whole thing is a tree formed from pairs. ...
    (comp.lang.lisp)
  • Re: Lisp at sexps length
    ... A tree doesn't look. ... tokens or characters but it's not *the* stream. ... >> parens for grouping and something else for lists. ... it means to construct a syntax object which means addition. ...
    (comp.lang.lisp)
  • Trees
    ... Trees are another way to organize data for rapid lookups. ... a sort in the usual sense, a tree can be mechanically transformed into ... The advantage of dynamic lists over ... entry 'tree-insert' using root. ...
    (comp.lang.cobol)
  • www.CeBeans.com - new program listings - Feb 4 2008
    ... This program is database of 25 categories of recipes on the cmu.edu website. ... This program allows you build custom play lists of Mp3 and WMA files on your ... hitting a tree. ... Free man every round after five. ...
    (microsoft.public.pocketpc)
  • Re: Do people dislike parentheses or the conceptual mismatch with trees?
    ... the real life cases where you actually /have/ tree-like structures, ... difference between a list and a tree a is that the tree is ... Lists are more primitive. ... SCRAWL> h ...
    (comp.lang.lisp)