Re: Regex greedy/lazy problem
- From: sbparsons <sbparsons@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Tue, 19 Jun 2007 06:30:08 -0700
Hi Kevin, and thanks for the response.
Yes - I was concerned about over complicating the message so I omitted a few
rules.
Basically I have a series of files transferred through sockets and the
receiving socket is parsing the data as it arrives - and is not waiting for
the whole stream to arrive.
The files may be either ascii or binary but all transferred as binary. The
receiving socket usese the GetString method on the byte array and parses that
when it determines that the start/end of a file is in the current chunk. So
there may well be angle brackets inside the string in addition to those
introduced by the sending socket.
Yes, the tags may be split but I can handle the case when there is no match
easilly enough.
I've taken a look at your solution and it doesn't appear to handle newline
characters for the content. From my reading it appears that the DOT can treat
carriage returns as characters but am unsure what other constructs are
available for this.
Thanks again for the reply.
Sean
"Kevin Spencer" wrote:
You really haven't clarified your rules. Several things are not clear..
Will the "chunks" ever split the tags themselves?
What sort of characters may be in the "content" between the tags?
Making a couple of assumptions, I came up with the following:
(?:(?<startTag><DALFile>))?(?<content>[^<]*)(?:(?<endTag></DALFile>))?
This can be broken up into 3 sections:
(?:(?<startTag><DALFile>))?
0 or 1 sequence of "<DALFile>" - assumption that it is never broken.
(?<content>[^<]*)
0 or more characters that are NOT '<' - assumption that the '<' character
may not appear between the tags.
(?:(?<endTag></DALFile>))?
0 or 1 sequences of "</DALFile>" - assumption that it is never broken.
Why are you not waiting until you get all of the string to parse it, rather
than attempting to parse "chunks?"
--
HTH,
Kevin Spencer
Microsoft MVP
Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
"sbparsons" <sbparsons@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:E36CA675-E24A-4320-8618-A829631EA81D@xxxxxxxxxxxxxxxx
I have a scenario where a string is sent in chunks to my app. I need to be
able to identify certain tags in this partial string as it arrives.
eg
<DALFile>xxxxxxxxx</DALFile>
I need to be able to have a regex that will capture the start, middle and
end of this file based on the tags. The problem is that the end tag may
not
always be present, and the content (xxxxxxx in this case) may contain
carriage returns.
My attempt so far would be along the lines:
(?<startTag><DALFile>)(?<content>.*)<?<endTag></DALFile>)?
This will work but will return the </DALFile> in the <content> group.
Making the .* lazy (i.e. .*?) will work but only if the end tag is
present,
which may not always be the case as the string is chunked.
The following also works but if there's a carriage return in xxxxxx it
does
not return all the content:
(?<startTag><DALFile>)(?<content>[^</DALFile>]*)(?<endTag></DALFile>)?
Would someone be able to point out a way that would suit all scenarios?
Thanks in advance.
- Follow-Ups:
- Re: Regex greedy/lazy problem
- From: Kevin Spencer
- Re: Regex greedy/lazy problem
- References:
- Re: Regex greedy/lazy problem
- From: Kevin Spencer
- Re: Regex greedy/lazy problem
- Prev by Date: How to define CrystalReport object in ASP.NET ?
- Next by Date: Re: Adding values to a combobox from a set of ENUM constants from
- Previous by thread: Re: Regex greedy/lazy problem
- Next by thread: Re: Regex greedy/lazy problem
- Index(es):
Relevant Pages
|