Re: Hub transport regex is broken, a horrible implementation, or I'm an idiot.



drsmith <drsmithhm@xxxxxxxxxxx> wrote:

Ok - let me start by saying that I've been using regex for the last 10
years on UNIX-like platforms. I have experience with awk, sed, ed,
grep, egrep, and perl - all of which use a standard regex
implementation(or mostly standard, anyway).

I created a rule that should examine the recieved header and perform a
specific action. The problem is that Microsoft's implementation
doesn't follow their documentation from what I can tell. Look at the
following pattern:

from\s\S*.mydomain.com\s\S*\sby\smyserver.mydomain.com\s
^
|
+-- Do you really want "any character" here?
Or did you mean "\." or "\s+"?

The documentation says that \s matches any whitespace character, \S
matches any non-whitespace character, and * matches multiple
occurrences of the preceding character.

I don't know which regex engine they use, but "*" usually matches
"zero or more" of the preceding character. Using "\S*" would entail an
awful lot of bactracking in the regex engine and probably not produce
the results you want.

If that were true, the above
pattern would match:

Received: from unixserver.mydomain.com ([192.168.1.4]) by
myserver.mydomain.com ([192.168.1.6]) with Microsoft SMTPSVC...


This might work:
\breceived:\s+\S+\s+mydomain\.com\s+\S+\s+by\s+myserver\.mydomain\.com\b
^^^
|
+- your regex looked for just one character here.
you really want to match whitespace folloed by
your domain

If the regex engine is using PCRE I'd add a "?" after those "+" signs
to make the expresseion "non-greedy" and reduce the backtracking.

.. but it doesn't. I'm very much used to using the combination, .* to
match any combination of characters,

The ".*" is usually misused. It's a "greedy" expression and it'll
match the rest of the string and then have to backtrack to match the
rest of the regex.

but Microsoft eliminated that for
these backslash combinations that don't truly match *any* character -
including whitespace characters.

I can get this match to work if I change it as follows:

from\s(a-zA-Z0-9)*.mydomain.com\s\S*\sby\smyserver.mydomain.com\s

Character classes are usually surrounded by square brackked, not
parentheses. That parentesized expression "(a-zA-Z0-9)" represents a
"memorized" match of zero or more strings of "a-zA-Z0-9". I think you
meant "[a-zA-Z0-9]" and you shouold follow that with a "+", not a "*".

but then it won't match if the server name doesn't contain a number.
In my environment, some do have a number and some don't.

Use a character class (i.e. "[...]"), not a memorized expression (e.g.
"(...)".

So - can anyone out there tell me if there's a robust way to write a
rule that will detect when a message has been issued by a server in my
domain via the SMTP protocol? I need this to so I can act on messages
that are from the internet in general versus the messages that are
generated by my internal servers.

\breceived:\s+\S+\s+(server1|server2|server3|...)\.mydomain\.com\s+\S+\s+by\s+myserver\.mydomain\.com\b

or

\breceived:\s+\S+\s+\S+\.mydomain\.com\s+\S+\s+by\s+myserver\.mydomain\.com\b

And before anyone suggests that I use the 'from users outside the
organization' predicate, I can already tell you it won't work thanks
to the fact that a lot of spam these days has falsified from/to
addresses *and* the fact that none of the available predicates allow
me to filter based on the envelope addresses.

Thanks to anyone who can help bring some sanity to this mess.


--
Rich Matheisen
MCSE+I, Exchange MVP
MS Exchange FAQ at http://www.swinc.com/resource/exch_faq.htm
Don't send mail to this address mailto:h.pott@xxxxxxxxxxxxx
Or to these, either: mailto:h.pott@xxxxxxxxxxxxxxx mailto:melvin.mcphucknuckle@xxxxxxxxxxxxx mailto:melvin.mcphucknuckle@xxxxxxxxxxxxxxx
.



Relevant Pages

  • Re: In Find and Replace: How To Find Any Combination Of Characters
    ... When I use wildcards, Word can¹t search for certain items. ... As for RegEx, ... It helps greatly to have a very accurate and definitive "problem statement" ... matches any single character, but only ONE character in the ...
    (microsoft.public.mac.office.word)
  • Re: My CPU Hates Me
    ... Then it just soaks up my CPU and makes me cry. ... What I mean by this is that using matches everything to the end of the line and then the regular expression backtracks to find the next " character specified. ... This stops the regex from getting past the next " character of each field and eliminates all that backtracking. ...
    (comp.lang.ruby)
  • Re: Using a regexp as field separator does not work!
    ... "field separator" if the regex '| *' is used as FS. ... alternation is used? ... just like an FS of a single blank character is a special case. ... could be optimized and awk doesn't try to analyze and warn you about any of them ...
    (comp.lang.awk)
  • Re: Using a regexp as field separator does not work!
    ... "field separator" if the regex '| *' is used as FS. ... alternation is used? ... just like an FS of a single blank character is a special case. ... could be optimized and awk doesn't try to analyze and warn you about any of them ...
    (comp.lang.awk)
  • Re: regex, negations, grep, find and replace (a few questions)
    ... I do not know much regex. ... But it seems as if you define each character ... be aware that different tools may use slightly different syntaxes for the same regular expressions. ... expresion" might mean "matching everything not matching the regular expression" or, in other words, removing everything matching the regexp. ...
    (alt.os.linux)