Re: Extract domain names out of URLs

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



On Wed, 23 Apr 2008 10:23:45 -0400, "Rick Rothstein \(MVP - VB\)"
<rick.newsNO.SPAM@xxxxxxxxxxxxxxxxxx> wrote:

re.Pattern =
"\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"

Now that is what I miss about Regular Expressions from my days many years
ago working with them in the UNIX world... their clarity and readability.<g>

Rick

<ggg>

And even when you write out the explanation:

===============================
URL capturing

\b((https?|ftp)://)?([-A-Z0-9.]+)(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

Options: case insensitive

Assert position at a word boundary «\b»
Match the regular expression below and capture its match into backreference
number 1 «((https?|ftp)://)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the regular expression below and capture its match into backreference
number 2 «(https?|ftp)»
Match either the regular expression below (attempting the next
alternative only if this one fails) «https?»
Match the characters ?http? literally «http»
Match the character ?s? literally «s?»
Between zero and one times, as many times as possible, giving back
as needed (greedy) «?»
Or match regular expression number 2 below (the entire group fails if
this one fails to match) «ftp»
Match the characters ?ftp? literally «ftp»
Match the characters ?://? literally «://»
Match the regular expression below and capture its match into backreference
number 3 «([-A-Z0-9.]+)»
Match a single character present in the list below «[-A-Z0-9.]+»
Between one and unlimited times, as many times as possible, giving back
as needed (greedy) «+»
The character ?-? «-»
A character in the range between ?A? and ?Z? «A-Z»
A character in the range between ?0? and ?9? «0-9»
The character ?.? «.»
Match the regular expression below and capture its match into backreference
number 4 «(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the character ?/? literally «/»
Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
The character ?-? «-»
A character in the range between ?A? and ?Z? «A-Z»
A character in the range between ?0? and ?9? «0-9»
One of the characters ?+&@#/%=~_|!:,.;? «+&@#/%=~_|!:,.;»
Match the regular expression below and capture its match into backreference
number 5 «(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the character ??? literally «\?»
Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
The character ?-? «-»
A character in the range between ?A? and ?Z? «A-Z»
A character in the range between ?0? and ?9? «0-9»
One of the characters ?+&@#/%=~_|!:,.;? «+&@#/%=~_|!:,.;»


Created with RegexBuddy
======================================
--ron
.



Relevant Pages

  • Can anyone write this recursion for simple regexp more beautifully and clearly than the braggarts
    ... I know that lisp eval is written more clear than this recursion below ... The Practice of Programming ... The problem was that any existing regular expression package was far ... c Matches any literal character c. ...
    (comp.lang.c.moderated)
  • Re: RegEx: How to ignore the number of whitespaces?
    ... a "simpler" regular expression syntax is likely to bite you eventually, ... but that some of these character sequences may be "marked" as ... This is a regular expression "if" conditional statement, ... do not understand why the pattern "personal computer" will only match ...
    (microsoft.public.dotnet.framework)
  • Re: logcheck.violations.ignore --does not work
    ... Peter T. Breuer wrote: ... it would not take care of it. ... Just use a correct regular expression. ... the period character match any single ...
    (comp.os.linux.security)
  • Re: Reading a variable line by line with while loop
    ... and maybe someone can correct my regular expression so it works to weed ... inside of brackets treats it as a literal character ... Your description of the dot and asterisk regex metacharacters is ... glob characters. ...
    (Ubuntu)
  • Re: The implementation of a regular expression
    ... But I'm stuck at the thing called regular expression .As we all ... Strictly speaking, what you're referring to is called "globs", although they ... In most shells, the glob characters are: ... matches zero or more characters, ...
    (comp.unix.shell)