Re: How to efficiently determine if a string contains any one of many strings
- From: "Nicholas Paldino [.NET/C# MVP]" <mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 20 Jun 2007 17:26:35 -0400
If you are looking to apply an algorithm similar to determining what is
spam (in your case, the same algorithm, but using it to classify it as
something else), then you might want to google the term "bayesian". This is
the algorithm that is used in most heuristic spam filters.
Even better, search for "bayesian spam" and the second item that is
returned (on google) should give you some open source projects you can use
to implement the algorithm and tweak it to your needs. Here is the direct
link, in case you can't find it:
http://spambayes.sourceforge.net/windows.html
Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx
<kenfine@xxxxxxxxxxxxx> wrote in message
news:u8NUc73sHHA.3504@xxxxxxxxxxxxxxxxxxxxxxx
Thanks Peter. Your response is most generous and appreciated, and is
exactly the kind of insight I was looking for. As a Fine Arts major,
"computer science" was not something I had the opportunity to study
rigorously. Learning about it now is pretty darned interesting.
Perhaps I could press you or others to speculate on one other CS-ish
thing. I am interested in machine classification of unknown texts. Some
kinds of classifications lend themselves to searches for string literals:
"Department of Oncology", or "Peter L. Ward" or "Lewis Hall." Other kinds
of classifications are less clear via exact string matching: we have
things like "Physical Science" or "Campus" or "Sports."
Spam filters somehow analyze texts and come up with a numeric score that
assesses an unknown piece of content's likelihood of being spam -- let's
call it "Spamminess". I am interested in analyzing unknown text and being
able to assess their "Campus-ness" or "Physical Science-ness" or
"Sports-ness".
I could imagine this working at least two ways:
1) Develop a series of rules for something like the category "Science" --
e.g. if the unknown article contains X and Y and Z metadata terms, give
one point for each instance of the term and divide by the number of words.
In this scenario you spend a lot of time thinking about and programming
good rules.
2) I have large collections of known texts which are already classified
with metadata pertaining to topics. Turn the machine on those texts and
have it derive rules and observations from that pre-existing base of
content. Use those rules for scans of unknown texts.
#2 sounds a lot more interesting but I have no idea how to proceed, what
classics of CS literature I should be looking at, or whether it is even
viable. If you have a starting point, I'd be very interested.
Thanks again for your help, much appreciated.
-KF
"Peter Duniho" <NpOeStPeAdM@xxxxxxxxxxxxxxxx> wrote in message
news:op.tt8ltqh38jd0ej@xxxxxxxxxxxxxxxxxxxxxxx
On Wed, 20 Jun 2007 11:48:55 -0700, <kenfine@xxxxxxxxxxxxx> wrote:
I am interested in scanning web pages for content of interest, and then
auto-classifying that content. I have tables of metadata that I can use
for
the classification, e.g. : "John P. Jones" "Jane T. Smith" "Fred
Barzowsky"
"Department of Oncology" "Office of Student Affairs" "Lewis Hall" etc.
I am wondering what the efficient way to do this in code might be. The
dumb
and brute-force way would be to loop through the content with each of
thousands of these metadata-items-turned-search terms. This sounds very
slow. I am wondering if there is a way to do a simultaneous comparison
all
at once.
Since we've already covered the "don't bother optimizing until you know
you need to" part of the question... :)
I would expect that this is a standard algorithm, akin to spell-checking.
I admit, I tend not to be as up on the "science" side of "computer
science" as I ought to be, but I'd guess you'd find the solution in a
book like "Programming Pearls", etc.
Ignoring the likelihood that someone's already figured out a good,
efficient algorithm to do this... :)
To me it seems like a good opportunity to use a state diagram sort of
thing. Based on your input strings, you'd build a state diagram in which
the next state from a given state is determined by the next input
character. For example, starting from the base state, reading a "J"
would send you to the next state that is shared by the possible input
strings "John P. Jones" and "Jane T. Smith". From that state, reading
the next character would send you to one of two states, one used by "John
P. Jones" and the other used by "Jane T. Smith".
Of course, assuming more input strings to match, you'd have a lot more
shared states. But the general idea is the same. At some point, you'd
reach a terminal state that would identify a matched string. If you want
to allow matches where some of your search strings contain other search
strings, then you'd have to identify those cases and have a given
terminal state identify multiple matches. If you want to allow matches
to overlap, then things get really complicated. :) I think at that
point, you'd have to provide for the ability to have multiple state
diagrams active concurrently (not difficult, just more complicated).
Using a mechanism like that, you'd only have to scan through the string
once, using each character to drive the state diagram(s).
Note that this technique would work best when you have a constant set of
strings to match, which you apply repeatedly to a large number of input
texts. It sounds like that's the scenario you're in, but if not then
perhaps a brute-force, or indexing-based solution would work as well, or
even better. The state diagram solution is fairly memory hungry, and
requires some processing time up-front to create the state diagram.
Pete
.
- References:
- Prev by Date: DataTable properties and methods
- Next by Date: Re: DataTable properties and methods
- Previous by thread: Re: How to efficiently determine if a string contains any one of many strings
- Next by thread: Re: How to efficiently determine if a string contains any one of many strings
- Index(es):
Relevant Pages
|