Re: Regular Expression Matches

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Well, there are 3 groups. <caturl> is a group name. The other 2 are unnamed.
Why do they need to be named if <caturl> is named? I'm not interested in the
other groups. I'm simply using them as "delimiters" for lack of a better
word.

I've modified the expression to look like this:

(href=)(?<caturl>.*)(class=title.*\[&nbsp;&nbsp;&nbsp;)

This gives the exact same results.

The escaping stuff gets a little confusing because the regular expressions
are actually stored in an XML file, so they get escaped for that.

In the XML file that looks like:

(href=)(?&lt;caturl&gt;.*)(class=title.*\[&amp;nbsp;&amp;nbsp;&amp;nbsp;)

This still isn't returning multiple results. Just the last match. I don't
think the < was the problem.

Pete

"Kevin Spencer" <kevin@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:%23$t0JWxNGHA.1032@xxxxxxxxxxxxxxxxxxxxxxx
Hi Pete,

You need to escape the '<' and '>' characters in your Regurlar Expression.
These are used in some flavors of Regular Expression language to indicate
a named group. If the first (<caturl>) is a group name, name both groups
or neither.

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
To a tea you esteem
a hurting back as a wallet.


"Pete Davis" <pdavis68@[nospam]hotmail.com> wrote in message
news:1eSdncq4reeQvWbenZ2dnUVZ_tWdnZ2d@xxxxxxxxxxxxxxx
I'm using regular expressions to extract some data and some links from
some web pages. I download the page and then I want to get a list of
certain links.

For building regular expressions, I use an app call The Regulator, which
makes it pretty easy to build and test regular expressions.

As a warning, I'm real weak with regular expressions. Let's say my
regular expression is:

(href=)(?<caturl>.*)(class=title>\[&nbsp;&nbsp;&nbsp;)

Now, using The Regulator and giving it the source for a particular web
page, I get 8 matches.

According to the regulator, the options it's using are:

Multiline, ignore case, ignore whitespace

In my own code, I'm doing:

Regex indexRegex = new Regex(categoryListRegex,
RegexOptions.Multiline |
RegexOptions.IgnorePatternWhitespace |
RegexOptions.IgnoreCase);
MatchCollection indexMatches = indexRegex.Matches(pageText);

This only returns one match in indexMatches with the same page that I'm
giving The Regulator. It seems that no matter what combination of regex
options I use, I'm only getting one match.

Why is that? How do I get all 8 matches?

Thanks.

pete






.



Relevant Pages

  • Re: RegEx - Chk for special chars
    ... the dot in regular expressions has a special meaning. ... Generally it never hurts to escape, so when something could go either way, ...
    (comp.lang.php)
  • Re: RegEx - Chk for special chars
    ... the dot in regular expressions has a special meaning. ... Generally it never hurts to escape, so when something could go either way, ...
    (comp.lang.php)
  • Re: A string-replace?
    ... is there a good way to escape ... regular expressions so they would be treated as strings by ppcre (the ... You can write a wrapper around REGEX-REPLACE, ...
    (comp.lang.lisp)
  • Re: regex/preg_replace() difficulty
    ... You need to escape the period "." ... > Regular Expressions: How can I indicate that the contents of a term ... > (user input*) needs to be treated as 'non-operators/control characters' ...
    (php.general)
  • Re: Regular Expression Matches
    ... These are used in some flavors of Regular Expression language to indicate a ... makes it pretty easy to build and test regular expressions. ... Regex indexRegex = new Regex(categoryListRegex, ... MatchCollection indexMatches = indexRegex.Matches; ...
    (microsoft.public.dotnet.languages.csharp)