Re: Splitting a string with Regex and keep the separator



* nagar@xxxxxxxxxxxxxxxx wrote, On 5-6-2007 22:06:
One more thing Jesse.
I noticed that the Key(val) is not interpreted correctly if I have two
expressions attached to one another.

For example

test#Key(F1) is interpreted correctly
test#Key(F1)#Key(F2) is not

How should I change the expression?
Thanks again.
Andrea

After some careful testing I got the other issues fixed as well. The regex is quite big already. I'll try to explain what's going on where.

First, the regex:

\G((?<other>((?!#[^\(]+\([^\)]+\)).)+)|(?<keyval>#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)))

To extract the fields you can use this:

foreach (Match m in ms)
{
if (m.Groups["keyval"].Success)
{
string key = m.Groups["key"].Value;
foreach (Capture c in m.Groups["token"].Captures)
{
string token = c.Value;
}
}

else if (m.Groups["other"].Success)
{
ManupulateOther(m.Groups["other"].Value);
}
}

And now for the explanation:

\G -> Make sure every new match is directly adjacent to the previous one, so we're not skipping invalid input

(?<other>) -> Match the 'other' text into a group named "other"
((?!#[^\(]+\([^\)]+\)).)+ Match every character that isn't the start of a key/val pair. I'm doing this by looking ahead to see if a keyval structure is found, and if it isn't I add one character to the match (.).

If we're at the end of an "other section" there's two options, either the end of the string, in which case the regex just stops matching, or there's the start of a key/val thingy.

(?<keyval>) -> match the whole key/val structure into a group named "keyval"

#(?<key>\w+)\( -> match the key an put it in a group named "key". The key comes directly after a "#" and only contains one or more alphanumeric characters (\w+) followed by "("

(((?<token>(?>\w+))\s*)+) -> Match every token into a group called "token". If this group captures multiple tokens they're added to the group's Captures collection in the order in which they're found. A token is made up of one or more alphanumeric characters (\w+). It can be followed by zero or more spaces. The (?>...) construction is used to prevent too much backtracking going on. The whole token-followed-by-space can exist multiple times. As the final token will not have a space behind it I used \s*.

\) -> and finally the closing parenthesis.

Keep in mind that if you use the RegexOptions.IgnorePatternWhitespace, you can reflow the regex to be easier to read. It's also easier to add comments that way.

@"
(?# Start of the previous match)
\G
(
(?#
Match any character until you fin the start of
A key/val pair.
)
(?<other>((?!#[^\(]+\([^\)]+\)).)+)
|
(?#
Match a key/val pair. Put the keyname in a group
and every token in another.
)
(?<keyval>#(?<key>\w+)\(((?<token>\w+)\s*)+\))
)
";

One alternative to this whole approach I haven't tested yet, but would work none the less is to only look for the special key/val thingies with only the right subexpression:

#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)

And query the start/end location of each match to determine if there were any other characters since the last found match. You can then extract those characters with a substring function. I'm not sure which option is faster, but I would not be surprised if the substring option would work even better, though it would contain more coding.

Jesse Houwing
.



Relevant Pages

  • Re: Splitting a string with Regex and keep the separator
    ... I want also to thank you for the regex explanation. ... a key/val pair. ... If this group captures multiple tokens they're added to the ... is made up of one or more alphanumeric characters. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: regex help partial code
    ... I have with great effort started on teh regex implemention and am ... the second token can't have whitespace. ... characters at the end, but your test didn't have any. ... Though, if there can't be whitespace in any of the tokens, you might ...
    (comp.lang.ruby)
  • Re: Iron Heroes: the good things
    ... I really like the idea of customizing human characters. ... >> Essentially, you earn tokens for doing certain things, mostly in combat ... >> Skill Groups, which essentially allow the character to spend 1 skill ... The greatest thing about it is the Fury tokens. ...
    (rec.games.frp.dnd)
  • Re: Problem defining egin{CJK} . . . end{CJK} in a macro
    ... The process of transforming stuff into tokens while reading input ... characters while reading them from input-file are treated like ... executed when defining takes place but when the defined macro ... - Start a CJK-environment ...
    (comp.text.tex)
  • Re: parsing VB code with a regex
    ... Any characters to the right of the comment are ... So, now that we're down to a single string, and a few simple rules: ... All other tokens should be matched. ... regular expression, it is not available for further matching. ...
    (microsoft.public.dotnet.general)