Re: Checking if a list of names appears in a body of text.




"Brent" <writebrent@xxxxxxxxx> wrote in message news:412800b8-0dde-4c8c-ad53-66c257d02021@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
I have a list of company names (say, IBM, Corning, General Motors, and
another 5,000 of them).

If I take a body of text, a news article, for instance, and I want to
see which company names appear in that text, is there an efficient way
to do this?

I thought about looping through the array of names, and doing an
IndexOf or Regex match, but this method is slow. Then I thought about
an array intersection, but this is problematic for two-word company
names (you can't just create the second array based on a split on
spaces).

Any hints would be much appreciated!


Finding the phrase
How do we use the MakePattern method to find our phrase? Let's suppose that we aren't interested in where the phrase occurs, or whether it occurs several times, but just whether or not it appears at all. So our approach will be to take the original phrase, turn it into a pattern, match the pattern, and return true if the pattern has been matched:

public Boolean PhraseFound(String argPhrase, String argText)

{

String strPattern = MakePattern(argPhrase);

Match match = Regex.Match(argText, strPattern);

return match.Success;

}



I used the Regex.Match to find the occurrence of a word in a text field or variable. You can also use the features of Regex that will find the positions of the words so that can use something like a RichTextBox and position to the word or words in the textbox and highlight all the words.





.



Relevant Pages

  • Re: Bug in Strings split method???
    ... The limit parameter controls the number of times the pattern is ... applied and therefore affects the length of the resulting array. ... limit n is greater than zero then the pattern will be applied at most n ...
    (comp.lang.java.programmer)
  • Re: Surprise in array concatenation
    ... >> computational states for which A is considered be 1. ... >> memory dump and discover a bit pattern 000000001 at the address FF07712CA0 ... could you stop naming this ADT an array? ... > These are Ada arrays, rock solid low level stuff, based on preexisting ...
    (comp.lang.ada)
  • How to improve performance of regular expression pattern matching
    ... a specific pattern on each line. ... so for each line of the input file the script needs to loop ... through the array until it has either found a pattern stored in the ... I first used the following code to compare the line with the array ...
    (comp.unix.shell)
  • Re: Regular expression, (preg_split etc...), some help please.
    ... Because that pattern doesn't allow letters outside quoted strings. ... The first array contains all the data, ...
    (comp.lang.php)
  • Re: Find a value in a table and return the cell reference
    ... ROW-1 will always be zero. ... Why it is necessary to include this phrase? ... the following array of numbers ... the cell reference in the spreadsheet nest to it. ...
    (microsoft.public.excel.worksheet.functions)