Re: Hub transport regex is broken, a horrible implementation, or I'm an idiot.
- From: "Rich Matheisen [MVP]" <richnews@xxxxxxxxxxxxxxxxxxxxx>
- Date: Fri, 27 Apr 2007 21:47:16 -0400
drsmith <drsmithhm@xxxxxxxxxxx> wrote:
Ok - let me start by saying that I've been using regex for the last 10^
years on UNIX-like platforms. I have experience with awk, sed, ed,
grep, egrep, and perl - all of which use a standard regex
implementation(or mostly standard, anyway).
I created a rule that should examine the recieved header and perform a
specific action. The problem is that Microsoft's implementation
doesn't follow their documentation from what I can tell. Look at the
following pattern:
from\s\S*.mydomain.com\s\S*\sby\smyserver.mydomain.com\s
|
+-- Do you really want "any character" here?
Or did you mean "\." or "\s+"?
The documentation says that \s matches any whitespace character, \S
matches any non-whitespace character, and * matches multiple
occurrences of the preceding character.
I don't know which regex engine they use, but "*" usually matches
"zero or more" of the preceding character. Using "\S*" would entail an
awful lot of bactracking in the regex engine and probably not produce
the results you want.
If that were true, the above
pattern would match:
Received: from unixserver.mydomain.com ([192.168.1.4]) by
myserver.mydomain.com ([192.168.1.6]) with Microsoft SMTPSVC...
This might work:
\breceived:\s+\S+\s+mydomain\.com\s+\S+\s+by\s+myserver\.mydomain\.com\b
^^^
|
+- your regex looked for just one character here.
you really want to match whitespace folloed by
your domain
If the regex engine is using PCRE I'd add a "?" after those "+" signs
to make the expresseion "non-greedy" and reduce the backtracking.
.. but it doesn't. I'm very much used to using the combination, .* to
match any combination of characters,
The ".*" is usually misused. It's a "greedy" expression and it'll
match the rest of the string and then have to backtrack to match the
rest of the regex.
but Microsoft eliminated that for
these backslash combinations that don't truly match *any* character -
including whitespace characters.
I can get this match to work if I change it as follows:
from\s(a-zA-Z0-9)*.mydomain.com\s\S*\sby\smyserver.mydomain.com\s
Character classes are usually surrounded by square brackked, not
parentheses. That parentesized expression "(a-zA-Z0-9)" represents a
"memorized" match of zero or more strings of "a-zA-Z0-9". I think you
meant "[a-zA-Z0-9]" and you shouold follow that with a "+", not a "*".
but then it won't match if the server name doesn't contain a number.
In my environment, some do have a number and some don't.
Use a character class (i.e. "[...]"), not a memorized expression (e.g.
"(...)".
So - can anyone out there tell me if there's a robust way to write a
rule that will detect when a message has been issued by a server in my
domain via the SMTP protocol? I need this to so I can act on messages
that are from the internet in general versus the messages that are
generated by my internal servers.
\breceived:\s+\S+\s+(server1|server2|server3|...)\.mydomain\.com\s+\S+\s+by\s+myserver\.mydomain\.com\b
or
\breceived:\s+\S+\s+\S+\.mydomain\.com\s+\S+\s+by\s+myserver\.mydomain\.com\b
And before anyone suggests that I use the 'from users outside the
organization' predicate, I can already tell you it won't work thanks
to the fact that a lot of spam these days has falsified from/to
addresses *and* the fact that none of the available predicates allow
me to filter based on the envelope addresses.
Thanks to anyone who can help bring some sanity to this mess.
--
Rich Matheisen
MCSE+I, Exchange MVP
MS Exchange FAQ at http://www.swinc.com/resource/exch_faq.htm
Don't send mail to this address mailto:h.pott@xxxxxxxxxxxxx
Or to these, either: mailto:h.pott@xxxxxxxxxxxxxxx mailto:melvin.mcphucknuckle@xxxxxxxxxxxxx mailto:melvin.mcphucknuckle@xxxxxxxxxxxxxxx
.
- References:
- Prev by Date: Re: Some Help Mobile Connectivity Blackberry Enterpise Server or Windows Smartphone
- Next by Date: Re: Administration Time
- Previous by thread: Hub transport regex is broken, a horrible implementation, or I'm an idiot.
- Next by thread: Re: Multiple host name on certificate
- Index(es):
Relevant Pages
|