Re: Regex doubt

From: Ben Lucas (ben_at_nospam.solien.nospam.com)
Date: 10/26/04


Date: Tue, 26 Oct 2004 10:38:39 -0700

Sriram,

I am not a Regular Expressions expert myself, but I ran this by Phil, the
author of the article I sent you. This was his response:

"Simplest option is to to a Regex.Replace with my expression and replace
with empty string. Then what you have left is non-tag content. Sometimes
the best use of Regexp is to match what you don't want and get rid of it."

-- 
Ben Lucas
Lead Developer
Solien Technology, Inc.
www.solien.com
"Sriram Krishnan" <ksriram@NOSPAMgmx.net> wrote in message 
news:eYxno$3uEHA.1308@TK2MSFTNGP09.phx.gbl...
> Nice article - but it doesnt do what I want. His article is on how to 
> match tags - my doubt is how to match all the other non-tag content. And I 
> really dont want to use HtmlAgilitypack - I'm learning RegEx and want to 
> figure how to do this. So what I'm looking for is something like "match 
> all the text that doesnt match that expression"
>
> -- 
> Sriram Krishnan
>
> http://www.dotnetjunkies.com/weblog/sriram
>
>
> "Ben Lucas" <ben@nospam.solien.nospam.com> wrote in message 
> news:HZOdnRG16skvHOPcRVn-tg@comcast.com...
>>A good friend of mine recently posted an article on his blog regarding 
>>using regular expressions to match HTML.  His article can be found at:
>>
>> http://haacked.com/archive/2004/10/25/1471.aspx
>>
>> Hope this helps.
>>
>> -- 
>> Ben Lucas
>> Lead Developer
>> Solien Technology, Inc.
>> www.solien.com
>>
>>
>> "Sriram Krishnan" <ksriram@NOSPAMgmx.net> wrote in message 
>> news:OXjLTu3uEHA.1616@TK2MSFTNGP10.phx.gbl...
>>> I'm doing some search-engine related work and want to match the actual 
>>> content of a html page (i.e any character which is not between a < and a 
>>>  >). I first wrote
>>>
>>> (?:\<.*?>) (?<content>.*?) <?:\<.*?>)
>>>
>>> which basically says match any text between a opening and a closing tag. 
>>> The problem with this is that you almost always have nested tags.This 
>>> exp is braindead as it chokes on nested tags. So this would match 
>>> something like / (this would match the '<img/>' 
>>> part).
>>>
>>> So I came up with
>>>
>>> (?![\<|>].*?>)
>>>
>>> But the problem with this negative-look ahead is that it doesnt advance 
>>> beyond the first negation - it just stops there.I have a feeling that 
>>> saying what I *dont* want is the way to go.
>>>
>>> I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which 
>>> says something like  - 'match any text that doesnt match this 
>>> expression'. Or is there any way to do reursive regex matching - that is 
>>> , within a pattern, match the pattern itself?In that case, the first 
>>> pattern could be made to work as I could have a recursive call inside 
>>> the (?<content>) pattern which keeps going down until you dont have any 
>>> more nested tags
>>>
>>> Thannks in advance
>>>
>>> -- 
>>> Sriram Krishnan
>>>
>>> http://www.dotnetjunkies.com/weblog/sriram
>>>
>>>
>>>
>>
>>
>
> 


Relevant Pages

  • Re: Regex doubt
    ... > "Sriram Krishnan" wrote in message ... >> I really dont want to use HtmlAgilitypack - I'm learning RegEx and want ... >> "match all the text that doesnt match that expression" ... tags.This exp is braindead as it chokes on nested tags. ...
    (microsoft.public.dotnet.framework)
  • Regex doubt
    ... I'm doing some search-engine related work and want to match the actual ... braindead as it chokes on nested tags. ... I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which says ... match the pattern itself?In that case, the first pattern could be made to ...
    (microsoft.public.dotnet.framework)
  • Re: string.remove bringt nicht das erwartete
    ... Ein Regex ist da etwas flexibler. ... pattern, "", System.Text.RegularExpressions.RegexOptions.IgnoreCase) ... Wenn man nun aber nach einem Backslash sucht ... gleichen gilt für alle anderen Metazeichen. ...
    (microsoft.public.de.german.entwickler.dotnet.vb)
  • Re: limits on regex?
    ... Same regex could work and fail on 2 different strings. ... not on the complexity of regex expression. ... storing / changing documents in DB and keep resulting pattern description ... > Does anyone have any hard data on the size / complexity limits of the ...
    (microsoft.public.dotnet.framework)
  • (patch for Bash) match, strinterval
    ... These 2 builtin commands can do regex or fixed-string matching. ... Return success if STRING contains REGEX pattern. ... which case SUBMATCH will contain 3 elements, ...
    (comp.unix.shell)

Quantcast