Re: Get regular expression
- From: "Kevin Spencer" <uce@xxxxxxx>
- Date: Wed, 21 Jun 2006 13:14:12 -0400
Hi Mike,
As far as Top-Posting is concerned, AFAIK it's still a matter of debate, and
as we're talking about Netiquette, not ISO or W3C standards, my personal
feeling is that anyone who scolds one about top- or bottom-posting has poor
sense of priority. After all, the purpose of groups such as this is
communication. I find it far more difficult to deal with poor communication
than with the format of a post, but that's just me! ;-)
In your case, you have done a pretty darned good job of communication, and I
appreciate that, so I will certainly do all I can to help out! I did have to
do a little research into ICD9, but that wasn't hard with Google.
It took me a few minutes of study to figure out (for the most part) what
your requirements are. Let me see if I can repeat them back to you in my own
words, and ask a couple of questions:
1. You have a set of data that is pure text, and is either stored in an
actual database, or in the text equivalent of a database as a multi-line
text document. I can't be exactly sure.
2. In any case, this data consists multiple single-line entries of text.
3. The data is stored in such a way that the text represents a hierarchical
structure of nodes.
4. This is achieved by a top-level classification that is repeated in each
"record" (line) for every record that falls under it.
5. Sub-nodes are indicated in the same way by the first text that follows
the top-level node text.
6. The node identifier text in the sub-nodes can be identified by comparing
it with other records that are under the top-level node. There is no other
way to distinguish this text from any other text in the record, other than
by comparing it with other records.
7. Therefore, the structure of the hierarchy can be inferred by using a
recursive procedure that identifies increasingly "deep" sub-nodes within the
set of records.
8. (Now here's where I'm a bit fuzzy). Your task is to put all of this into
some form of data structure that can be used as an index, probably a
hierarchical structure such as a tree.
Question: Will these records be ordered in any way? IOW, for example, will
they be ordered alphabetically? If they are ordered alphabetically, the
structure is already present, by virtue of the rules as stated above.
Otherwise, it will be necessary to do some form of re-scanning of the data.
Question: Can you tell me what sort of format the end result is supposed to
be in? Is it simply a data structure in memory? Or what?
--
HTH,
Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist
I recycle.
I send everything back to the planet it came from.
"Mike" <msgrinnell@xxxxxxxxxxx> wrote in message
news:1150899561.715476.17480@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Must say I get burned in six different ways. Some groups I top post
and get scolded. On other groups others people top post and nobody
appears to have a problem. I'll top post here.
Given I've been asked for details I'll provide them, but typically
nobody wants to wade through them.
In the dark ages I had 24,000 lines of ICD9 index entries which got
appended with ICD9 codes and were processed one time per year into a
big paper report with a tree-like structure by an assembler program on
an OS390. An abbreviated example of the report is below for the
Ablation entry.
Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34
Across my institution in the past there have been multiple "master"
copies of ICD9 codes and index entries. The order came down that
long-term we will work towards a single copy of ICD9 codes with index
entries that will be accessed via webservices. The structure of the
data in our old database was as follows (no line breaks -- each entry
was one line):
ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY
60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91
The new webservices still have this same index structure except now,
for example, "Ablation Vesicle Neck (ANAT = 60.02)" is just a property
of code 57.91. The surgical coders still want to view the index
entries in a tree structure on demand. Without getting into
mind-numbing details, I can jump through some hoops and get back a set
of index entries that look like above for ABLATION but they are not
formatted in the way the surgical coders desire. I believe I have a
recursive algorithm that will work to format these into a tree
structure but this algorithm is predicated on being able to find the
nodes.
If you look carefully, the root node for entire set of index entries
above is "ABLATION" (as that is what begins each entry and repeats
across all of them). Subsequently, Endometrial (Hysteroscopic) + code
is a child of ABLATION with no children of its own because it is not
repeated. Next, Heart (Conduction Defect) + code is a node with "With
Catheter + code" as a child of that node because "Heart (Conduction
Defect)" repeats across both those lines.
I have begged the group that now owns the webservice to allow me to
restructure the data but no go (they say that would be bastardizing the
concept of everything being 'code-centric'). I am stuck with this and
also with the demand by the coders that they get the formatted tree
structure to look at when they code.
In general, I think if I do the following I can figure out the nodes
and children:
1. Read index entries until the first word changes.
2. Get the substring that begins the string and is repeated elsewhere
in the string (this is the node).
3. Remove that node and keep processing until the base case is hit etc.
If anyone has any better ideas of how to deal with this I would be
thrilled to no end to hear them.
Thanks,
Mike
Kevin Spencer wrote:
I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
your suggestion first off in the morning.
First, "HEART (CONDUCTION DEFECT)" is not an expression. That is a
substring of the original string. The regular expression is the string
"^(.+)(?=\s*).*\1" that you are using to get your match. Assuming that
"HEART (CONDUCTION DEFECT)" is your match (which it is not), you could
call
it a match for the regular expression (which may match more than once in
a
string). But it is a substring of the original string. It may seem picky,
but in order to communicate effectively, one must use the right terms. As
an
example, if I told you that I ate a car for breakfast, would you know
that I
ate an apple?
Second, the string you posted contains 2 instances of the substring
"HEART
(CONDUCTION DEFECT)". Do you want to get both of them? If so, what
exactly
are your pattern-matching rules? A regular expression matches a pattern.
Obviously, not all of the strings you will be working with will be:
" HEART (CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 "
In fact, probably due to this being a newsgroup, and my using a
newsreader,
I would doubt that the line breaks in the string are where they are, if
they
are. And I have to wonder whether the string actually begins and ends
with a
space.
In other words, you're going to be using a regular expression to isolate
substrings of various strings (most probably). A regular expression is
shorthand for a set of rules that defines a pattern you're looking for.
Whether the strings contain line breaks, for example, is important. Your
regular expression begins with the caret '^' character. This character
can
indicate the beginning of a string, or the beginning of a line *or* a
string, depending upon what options you use. You didn't specify the
option(s) you're using, so we have no way to know.
In addition, your pattern is not likely to work in the way you expect.
for
example, the following would match:
THIS IS NOT WHAT THIS IS SUPPOSED TO BE. (Matches the phrase "THIS IS ")
And in addition, if there are line breaks, like your example (as split by
the newsreader), the matching substring would be:
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
So, can you explain what your rules are, and what you are trying to match
here? I'm just guessing that you're parsing medical transcriptions, but
beyond that, I'm stumped.
--
HTH,
Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist
I recycle.
I send everything back to the planet it came from.
"Mike" <msgrinnell@xxxxxxxxxxx> wrote in message
news:1150843139.613020.123120@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Xicheng Jia wrote:
Mike wrote:
I have a regular expression (^(.+)(?=\s*).*\1 ) that results in
matches. I would like to get what the actual regular expression is.
In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART
(CONDUCTION
DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2 " the expression is "HEART (CONDUCTION DEFECT)".
How
do I gain access to the expression (not the matches) at runtime?
you want to access the expression "HEART (CONDUCTION DEFECT)" or the
regex "^(.+)(?=\s*).*\1" at run-time?? dont think you can get exactly
the latter one though, for the previous one, you can use named
capture,
like
^(?<expr>.+)(?=\s*).*\k<expr>
and access the variable "expr" at run time?
Xicheng
I want to access the expression "HEART (CONDUCTION DEFECT)" I'll try
your suggestion first off in the morning.
Thanks,
.
- Follow-Ups:
- Re: Get regular expression
- From: Mike
- Re: Get regular expression
- References:
- Get regular expression
- From: Mike
- Re: Get regular expression
- From: Xicheng Jia
- Re: Get regular expression
- From: Mike
- Re: Get regular expression
- From: Kevin Spencer
- Re: Get regular expression
- From: Mike
- Get regular expression
- Prev by Date: Re: WebMethod returning an XmlDocument generating a compile error in the client.
- Next by Date: Re: using Type
- Previous by thread: Re: Get regular expression
- Next by thread: Re: Get regular expression
- Index(es):