Re: Regex doubt
From: Ben Lucas (ben_at_nospam.solien.nospam.com)
Date: 10/26/04
- Next message: Joshua Frank: "Re: Windows message -> .NET method"
- Previous message: Sriram Krishnan: "Re: Regex doubt"
- In reply to: Sriram Krishnan: "Re: Regex doubt"
- Next in thread: Sriram Krishnan: "Re: Regex doubt"
- Reply: Sriram Krishnan: "Re: Regex doubt"
- Reply: Haacked: "Re: Regex doubt"
- Messages sorted by: [ date ] [ thread ]
Date: Tue, 26 Oct 2004 10:38:39 -0700
Sriram,
I am not a Regular Expressions expert myself, but I ran this by Phil, the
author of the article I sent you. This was his response:
"Simplest option is to to a Regex.Replace with my expression and replace
with empty string. Then what you have left is non-tag content. Sometimes
the best use of Regexp is to match what you don't want and get rid of it."
-- Ben Lucas Lead Developer Solien Technology, Inc. www.solien.com "Sriram Krishnan" <ksriram@NOSPAMgmx.net> wrote in message news:eYxno$3uEHA.1308@TK2MSFTNGP09.phx.gbl... > Nice article - but it doesnt do what I want. His article is on how to > match tags - my doubt is how to match all the other non-tag content. And I > really dont want to use HtmlAgilitypack - I'm learning RegEx and want to > figure how to do this. So what I'm looking for is something like "match > all the text that doesnt match that expression" > > -- > Sriram Krishnan > > http://www.dotnetjunkies.com/weblog/sriram > > > "Ben Lucas" <ben@nospam.solien.nospam.com> wrote in message > news:HZOdnRG16skvHOPcRVn-tg@comcast.com... >>A good friend of mine recently posted an article on his blog regarding >>using regular expressions to match HTML. His article can be found at: >> >> http://haacked.com/archive/2004/10/25/1471.aspx >> >> Hope this helps. >> >> -- >> Ben Lucas >> Lead Developer >> Solien Technology, Inc. >> www.solien.com >> >> >> "Sriram Krishnan" <ksriram@NOSPAMgmx.net> wrote in message >> news:OXjLTu3uEHA.1616@TK2MSFTNGP10.phx.gbl... >>> I'm doing some search-engine related work and want to match the actual >>> content of a html page (i.e any character which is not between a < and a >>> >). I first wrote >>> >>> (?:\<.*?>) (?<content>.*?) <?:\<.*?>) >>> >>> which basically says match any text between a opening and a closing tag. >>> The problem with this is that you almost always have nested tags.This >>> exp is braindead as it chokes on nested tags. So this would match >>> something like/ (this would match the '<img/>' >>> part). >>> >>> So I came up with >>> >>> (?![\<|>].*?>) >>> >>> But the problem with this negative-look ahead is that it doesnt advance >>> beyond the first negation - it just stops there.I have a feeling that >>> saying what I *dont* want is the way to go. >>> >>> I'm a bit of a newbie to RegEx - and I'm trying to write a RegEx which >>> says something like - 'match any text that doesnt match this >>> expression'. Or is there any way to do reursive regex matching - that is >>> , within a pattern, match the pattern itself?In that case, the first >>> pattern could be made to work as I could have a recursive call inside >>> the (?<content>) pattern which keeps going down until you dont have any >>> more nested tags >>> >>> Thannks in advance >>> >>> -- >>> Sriram Krishnan >>> >>> http://www.dotnetjunkies.com/weblog/sriram >>> >>> >>> >> >> > >
- Next message: Joshua Frank: "Re: Windows message -> .NET method"
- Previous message: Sriram Krishnan: "Re: Regex doubt"
- In reply to: Sriram Krishnan: "Re: Regex doubt"
- Next in thread: Sriram Krishnan: "Re: Regex doubt"
- Reply: Sriram Krishnan: "Re: Regex doubt"
- Reply: Haacked: "Re: Regex doubt"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|