Re: Efficient regular expression pattern ?



Hi Steve,

Forgive me if I repeat anything you already know.

A regular expression is an expression of a set of rules that define a
"pattern" (or a pattern that defines a set of rules!).

It looks to me like you're trying to create SPAM detection rules, a most
admirable quest, and one which I am chipping away at during my own spare
time.

SPAM detection rules act as a filter, and if I'm not mistaken, you want to
filter out suspected SPAM emails.

For the greatest efficiency, you want to use the filter that removes the
most from the list in order to eliminate those from the remaining items that
need to be filtered. The fewer the items to filter, the faster the process.

Some filters will remove more than others. For example, let's say that
you're filtering on domain names. In your list, you identify the following:

^http://www.spywarerulez.com
^http://a.spywarerulez.com
^http://b.spywarerulez.com
^http://spywarerulez.com
^http://spywarerulez.net
^http://spywarerulez.org

Now, you could filter on all of them individually using several filters, but
if, for example, you know that any domain that ends with "spywarerulez."
plus a root domain, you could filter them all out with:

\b(?:http://)*\w*\.?spywarerulez.\w+\b

(Note that this assumes that the URL begins and ends at a word boundary. You
would probably want to change that part)

So, what I would do is to identify which regular expressions filter the most
items out first. Apply these filters first, and then create a succession of
more and more specific filters.

One good technique to prevent multiple filters is to use an OR expression,
as in:

\b(?:http://)*\w*\.?(spywarerulez|myspyware|spammersdelight|getrichquick|suckeraminute).\w+\b

This would match any domain with any of the domain names that are ORed
together.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

"Steve B." <steve_beauge@xxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:OqYAqb7jGHA.3816@xxxxxxxxxxxxxxxxxxxxxxx
The goal is to detect some pages loaded on specific "malicious" pages.
According that, most of the regexp will looks like this :

^http://www.spywarerulez.com/

This is supposed to detect any ressource for this host.

However most of this shitting companies use several subdomain names, and
regexp also look like this :

^http://[A-Za-z0-9_\.].spywarerulez.com/

I'm quite new in using Regex, so I'm not aware of the behaviour of all
options.

For the moment I've not tested the actual speed of this process, since I
do not have right now all urls. The app is self learning and the url will
grow when it will go into production and when the administrators will add
new addresses.

Thanks,
Steve

"Barry Kelly" <barry.j.kelly@xxxxxxxxx> a écrit dans le message de news:
h92092tjnrd3cmpt5qlqodbss0bpv39vjt@xxxxxxxxxx
"Steve B." <steve_beauge@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:

I'm building an application that analyse a flow of url in order to
detect
some pages.
I've a very huge list of regular expressions (up to several thousands)
that
I have to check on all urls.

Can you show us some of your regular expressions? Are many of them
simple text matches, or do they all involve wildcards? Are they using ()
which end up capturing when they don't need to (consider the
ExplicitCapture option)? Do you use the Compiled option?

It can be quite easy to write regular expressions with exponential
matching behaviour. For example, an RE which is supposed to match the
whole string and looks like ".*/foo/" will be much slower than
"^.*/foo/". Also, the current .NET regular expression matcher doesn't
perform some optimizations that many other matchers do.

Each regular expression will be evaluated agains all urls.

How can I write the code in order to be the most efficient ?

Profile it to find out what bits are slow. You might consider trying
each regular expression on a test set of URL data, and dumping out which
ones are matching slowest. That'll tell you which ones you need to
optimize.

But I'm afraid this process is quite slow since urls can input up to
10/seconds.

How slow is it currently?

I imagine you should be able to check many tens of thousands of regular
expressions per second.

-- Barry

--
http://barrkel.blogspot.com/




.