Re: How to efficiently determine if a string contains any one of many strings
- From: <kenfine@xxxxxxxxxxxxx>
- Date: Wed, 20 Jun 2007 14:06:23 -0700
Thanks Peter. Your response is most generous and appreciated, and is exactly
the kind of insight I was looking for. As a Fine Arts major, "computer
science" was not something I had the opportunity to study rigorously.
Learning about it now is pretty darned interesting.
Perhaps I could press you or others to speculate on one other CS-ish thing.
I am interested in machine classification of unknown texts. Some kinds of
classifications lend themselves to searches for string literals: "Department
of Oncology", or "Peter L. Ward" or "Lewis Hall." Other kinds of
classifications are less clear via exact string matching: we have things
like "Physical Science" or "Campus" or "Sports."
Spam filters somehow analyze texts and come up with a numeric score that
assesses an unknown piece of content's likelihood of being spam -- let's
call it "Spamminess". I am interested in analyzing unknown text and being
able to assess their "Campus-ness" or "Physical Science-ness" or
"Sports-ness".
I could imagine this working at least two ways:
1) Develop a series of rules for something like the category "Science" --
e.g. if the unknown article contains X and Y and Z metadata terms, give one
point for each instance of the term and divide by the number of words. In
this scenario you spend a lot of time thinking about and programming good
rules.
2) I have large collections of known texts which are already classified with
metadata pertaining to topics. Turn the machine on those texts and have it
derive rules and observations from that pre-existing base of content. Use
those rules for scans of unknown texts.
#2 sounds a lot more interesting but I have no idea how to proceed, what
classics of CS literature I should be looking at, or whether it is even
viable. If you have a starting point, I'd be very interested.
Thanks again for your help, much appreciated.
-KF
"Peter Duniho" <NpOeStPeAdM@xxxxxxxxxxxxxxxx> wrote in message
news:op.tt8ltqh38jd0ej@xxxxxxxxxxxxxxxxxxxxxxx
On Wed, 20 Jun 2007 11:48:55 -0700, <kenfine@xxxxxxxxxxxxx> wrote:
I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use
for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred
Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc.
I am wondering what the efficient way to do this in code might be. The
dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison
all
at once.
Since we've already covered the "don't bother optimizing until you know
you need to" part of the question... :)
I would expect that this is a standard algorithm, akin to spell-checking.
I admit, I tend not to be as up on the "science" side of "computer
science" as I ought to be, but I'd guess you'd find the solution in a book
like "Programming Pearls", etc.
Ignoring the likelihood that someone's already figured out a good,
efficient algorithm to do this... :)
To me it seems like a good opportunity to use a state diagram sort of
thing. Based on your input strings, you'd build a state diagram in which
the next state from a given state is determined by the next input
character. For example, starting from the base state, reading a "J" would
send you to the next state that is shared by the possible input strings
"John P. Jones" and "Jane T. Smith". From that state, reading the next
character would send you to one of two states, one used by "John P. Jones"
and the other used by "Jane T. Smith".
Of course, assuming more input strings to match, you'd have a lot more
shared states. But the general idea is the same. At some point, you'd
reach a terminal state that would identify a matched string. If you want
to allow matches where some of your search strings contain other search
strings, then you'd have to identify those cases and have a given terminal
state identify multiple matches. If you want to allow matches to overlap,
then things get really complicated. :) I think at that point, you'd have
to provide for the ability to have multiple state diagrams active
concurrently (not difficult, just more complicated).
Using a mechanism like that, you'd only have to scan through the string
once, using each character to drive the state diagram(s).
Note that this technique would work best when you have a constant set of
strings to match, which you apply repeatedly to a large number of input
texts. It sounds like that's the scenario you're in, but if not then
perhaps a brute-force, or indexing-based solution would work as well, or
even better. The state diagram solution is fairly memory hungry, and
requires some processing time up-front to create the state diagram.
Pete
.
- Follow-Ups:
- Re: How to efficiently determine if a string contains any one of many strings
- From: Peter Duniho
- Re: How to efficiently determine if a string contains any one of many strings
- From: Nicholas Paldino [.NET/C# MVP]
- Re: How to efficiently determine if a string contains any one of many strings
- References:
- Prev by Date: Re: Parsing a string to get the month
- Next by Date: Re: Service fails if user not logged in
- Previous by thread: Re: How to efficiently determine if a string contains any one of many strings
- Next by thread: Re: How to efficiently determine if a string contains any one of many strings
- Index(es):
Relevant Pages
|
Loading