Re: Efficient regular expression pattern ?

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



The suspects url will be stored and be used by multiples systems.
The first "client app" will be a proxy system that will kick this http
request and replace by a warning message (so even if the user try to display
a picutre in outlook, he won't actually send a request to the server and if
a "link" is clicked from an email, or IM, or anywhere else, the page won't
load if is forbidden).

Email filter is not planned now, but it should be a great idea to capitalize
malicious url.

Thanks,

Since

"Kevin Spencer" <kevin@xxxxxxxxxxxxxxxxxxxxxxxxxx> a écrit dans le message
de news: eimJaB8jGHA.984@xxxxxxxxxxxxxxxxxxxxxxx
Hi Steve,

Forgive me if I repeat anything you already know.

A regular expression is an expression of a set of rules that define a
"pattern" (or a pattern that defines a set of rules!).

It looks to me like you're trying to create SPAM detection rules, a most
admirable quest, and one which I am chipping away at during my own spare
time.

SPAM detection rules act as a filter, and if I'm not mistaken, you want to
filter out suspected SPAM emails.

For the greatest efficiency, you want to use the filter that removes the
most from the list in order to eliminate those from the remaining items
that need to be filtered. The fewer the items to filter, the faster the
process.

Some filters will remove more than others. For example, let's say that
you're filtering on domain names. In your list, you identify the
following:

^http://www.spywarerulez.com
^http://a.spywarerulez.com
^http://b.spywarerulez.com
^http://spywarerulez.com
^http://spywarerulez.net
^http://spywarerulez.org

Now, you could filter on all of them individually using several filters,
but if, for example, you know that any domain that ends with
"spywarerulez." plus a root domain, you could filter them all out with:

\b(?:http://)*\w*\.?spywarerulez.\w+\b

(Note that this assumes that the URL begins and ends at a word boundary.
You would probably want to change that part)

So, what I would do is to identify which regular expressions filter the
most items out first. Apply these filters first, and then create a
succession of more and more specific filters.

One good technique to prevent multiple filters is to use an OR expression,
as in:

\b(?:http://)*\w*\.?(spywarerulez|myspyware|spammersdelight|getrichquick|suckeraminute).\w+\b

This would match any domain with any of the domain names that are ORed
together.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

"Steve B." <steve_beauge@xxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:OqYAqb7jGHA.3816@xxxxxxxxxxxxxxxxxxxxxxx
The goal is to detect some pages loaded on specific "malicious" pages.
According that, most of the regexp will looks like this :

^http://www.spywarerulez.com/

This is supposed to detect any ressource for this host.

However most of this shitting companies use several subdomain names, and
regexp also look like this :

^http://[A-Za-z0-9_\.].spywarerulez.com/

I'm quite new in using Regex, so I'm not aware of the behaviour of all
options.

For the moment I've not tested the actual speed of this process, since I
do not have right now all urls. The app is self learning and the url will
grow when it will go into production and when the administrators will add
new addresses.

Thanks,
Steve

"Barry Kelly" <barry.j.kelly@xxxxxxxxx> a écrit dans le message de news:
h92092tjnrd3cmpt5qlqodbss0bpv39vjt@xxxxxxxxxx
"Steve B." <steve_beauge@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:

I'm building an application that analyse a flow of url in order to
detect
some pages.
I've a very huge list of regular expressions (up to several thousands)
that
I have to check on all urls.

Can you show us some of your regular expressions? Are many of them
simple text matches, or do they all involve wildcards? Are they using ()
which end up capturing when they don't need to (consider the
ExplicitCapture option)? Do you use the Compiled option?

It can be quite easy to write regular expressions with exponential
matching behaviour. For example, an RE which is supposed to match the
whole string and looks like ".*/foo/" will be much slower than
"^.*/foo/". Also, the current .NET regular expression matcher doesn't
perform some optimizations that many other matchers do.

Each regular expression will be evaluated agains all urls.

How can I write the code in order to be the most efficient ?

Profile it to find out what bits are slow. You might consider trying
each regular expression on a test set of URL data, and dumping out which
ones are matching slowest. That'll tell you which ones you need to
optimize.

But I'm afraid this process is quite slow since urls can input up to
10/seconds.

How slow is it currently?

I imagine you should be able to check many tens of thousands of regular
expressions per second.

-- Barry

--
http://barrkel.blogspot.com/






.



Relevant Pages

  • Re: Efficient regular expression pattern ?
    ... "pattern". ... SPAM detection rules act as a filter, and if I'm not mistaken, you want to ... what I would do is to identify which regular expressions filter the most ...
    (microsoft.public.dotnet.framework)
  • RE: Filter Regular Expressions
    ... Gomez, Gonzalo wrote: ... Hi, I want to filter text using regular expressions, but i ... don't know how to find in a file a string like this, ...
    (perl.beginners)
  • Re: Persecution nuisance
    ... Check to see if your news reader software allows filters based on ... Regular expressions are strings of text which ... I just set up a regular expression filter for all those MI5 ...
    (rec.sport.curling)
  • Re: [kde] Kmail spam filtering - Help
    ... I noticed that the filter supports regular expressions but unless I can ... in getting round content learning spam filters. ... One of the filter options in KMails filter setup ...
    (KDE)
  • Re: IIS Extensions in URL causes filter to break.
    ... If your filter operates the way you say, ... IIS can only say that one or more of the events contain ... every request, and maybe more than once, depending on whether the URL causes ... As for the order of handling between ISAPI Extensions and Filters - Filters ...
    (microsoft.public.inetserver.iis.security)