Re: Regex doubt

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: Ben Lucas (ben_at_nospam.solien.nospam.com)
Date: 10/26/04


Date: Tue, 26 Oct 2004 09:56:14 -0700

A good friend of mine recently posted an article on his blog regarding using
regular expressions to match HTML. His article can be found at:

http://haacked.com/archive/2004/10/25/1471.aspx

Hope this helps.

-- 
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com
"Sriram Krishnan" <ksriram@NOSPAMgmx.net> wrote in message 
news:OXjLTu3uEHA.1616@TK2MSFTNGP10.phx.gbl...
> I'm doing some search-engine related work and want to match the actual 
> content of a html page (i.e any character which is not between a < and a 
>  >). I first wrote
>
> (?:\<.*?>) (?<content>.*?) <?:\<.*?>)
>
> which basically says match any text between a opening and a closing tag. 
> The problem with this is that you almost always have nested tags.This exp 
> is braindead as it chokes on nested tags. So this would match something 
> like / (this would match the '<img/>' part).
>
> So I came up with
>
> (?![\<|>].*?>)
>
> But the problem with this negative-look ahead is that it doesnt advance 
> beyond the first negation - it just stops there.I have a feeling that 
> saying what I *dont* want is the way to go.
>
> I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which 
> says something like  - 'match any text that doesnt match this expression'. 
> Or is there any way to do reursive regex matching - that is , within a 
> pattern, match the pattern itself?In that case, the first pattern could be 
> made to work as I could have a recursive call inside the (?<content>) 
> pattern which keeps going down until you dont have any more nested tags
>
> Thannks in advance
>
> -- 
> Sriram Krishnan
>
> http://www.dotnetjunkies.com/weblog/sriram
>
>
> 


Relevant Pages

  • Re: [OT] Re: Chris Sonnack on VB.Nets putative Set statement
    ... > regex is in a C-ish language string. ... the ease of making errors in regular expressions is a concern ... You do need to replicate the pattern on either side of the comma ...
    (comp.programming)
  • Re: Regex Replace Help
    ... You're specifically looking for a pattern that consists of an &, not followed by any alphanumeric characters and a;. ... And last but not the least i collect all answers posted to my Regex ... learning regular expressions will be worth it. ... I'll try to play with it later today but no promises. ...
    (microsoft.public.dotnet.framework)
  • Re: Regular Expression Question
    ... good thing, and worth using, but usually overkill for simple patterns. ... One advantage of regex is that if the pattern used may need to be ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Need help with a regular expression to match words containing asterisk
    ... Apparently that was the regex I had not tried. ... > do not match regular expressions; ... A ksh example of a file name pattern that isn't a RE ...
    (comp.unix.shell)
  • [SUMMARY] Statistician I (#167)
    ... The heart of this problem, as suggested in the quiz description, is pattern ... have the pattern matching in place, the rest of the code is pretty trivial. ... use regular expressions, or at least review their knowledge. ... Next, as indicated in the comment, square brackets surrounding text are ...
    (comp.lang.ruby)