Re: CONTAINS and non-wildcard asterisk

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: John Kane (jt-kane_at_comcast.net)
Date: 08/04/04


Date: Tue, 3 Aug 2004 19:38:16 -0700

Johnny,
You can always remove the $ 'dollar sign' from the noise.enu (US_English)
file and then run a full population and see if you get different results.
Note, using the .NET framework's char.IsLetterOrDigit to determine if a
character is searchable is a very good idea!

Regards,
John

"Johnny Nielsen" <DONT_WRITE_THIStrap402@megabit.dk> wrote in message
news:#F121uTeEHA.2908@TK2MSFTNGP10.phx.gbl...
> I have only tested the CONTAINS clause on a Windows Server 2003 (SQL 2000
> SP3a).
>
> In the absense of proper ducumentation I have decided to use the .NET
> frameworks char.IsLetterOrDigit to determine if a character is searchable.
> But having found a '$' in the noise file I'm not sure if this is the right
> approach...
>
>
> Best regards
> Johnny Nielsen
>
> "John Kane" <jt-kane@comcast.net> wrote in message
> news:OHtNUSQeEHA.2352@TK2MSFTNGP09.phx.gbl...
> > Hilary,
> > An interesting deduction... Did you test on both Win2K and Win2003 (or
> > WinXP) as there are different OS-supplied wordbreakers for these OS
> > platforms that might result in different indexing strategies.
> >
> > Johnny, what is the OS platform that your SQL Server (select @@version)
is
> > installed on?
> >
> > Regards,
> > John
> >
> >
> > "Hilary Cotter" <hilaryk@att.net> wrote in message
> > news:O$0froMeEHA.3512@TK2MSFTNGP12.phx.gbl...
> > > According to my tests with the neutral and US Englishword breaker the
*
> is
> > > treated as white space when you have a single search argument and you
> > > wildcard it, and you wrap it in single quotes.
> > >
> > > In other words a search on test will return the same hits on a search
on
> > > test*.
> > >
> > > If you wrap the phrase with double quotes you will get wildcarding,
but
> it
> > > will still match with test*.
> > >
> > > Think of it this way. While indexing when MSSearch sees non white
space,
> > non
> > > alphanumeric characters it strips them out and indexes the
alphanumeric
> > > characters.
> > >
> > > So test* is indexed as test and is indistinguishable from test.
> > >
> > > When you search on test* the * is stripped off and the search is
> performed
> > > on test.
> > >
> > > Confusing I know.
> > >
> > > HTH
> > > --
> > > Hilary Cotter
> > > Looking for a book on SQL Server replication?
> > > http://www.nwsu.com/0974973602.html
> > >
> > >
> > > "Johnny Nielsen" <DONT.WRITE.THIStrap402@megabit.dk> wrote in message
> > > news:%236Y8P9KeEHA.2812@tk2msftngp13.phx.gbl...
> > > > "Hilary Cotter" <hilaryk@att.net> wrote in message
> > > > news:%23itsZkJeEHA.3512@TK2MSFTNGP12.phx.gbl...
> > > > > Yes that is correct. The subtlety is that non alphanumeric
> characters
> > > with
> > > > > some exceptions are treated as white space.
> > > >
> > > > ..but as I quoted from the MS online books the asterisk is *not*
> treated
> > > as
> > > > white space when used inside single-quotes - that's my point. To
> repeat
> > > the
> > > > MS statement from the online books, that I originally questioned:
[If
> > the
> > > > text and asterisk are not delimited by double quotation marks, as in
> > > > CONTAINS (column, 'text*'), full-text search considers the asterisk
as
> a
> > > > character and will search for exact matches to text*.]
> > > >
> > > >
> > > > > So test and test* are both indexed as test. So a search on test
will
> > > > return
> > > > > hits to test and test*.
> > > >
> > > > Well, according to MS the asterisk in single-quotes is treated as a
> > > > character and therefore test and test* should be considered two
> > different
> > > > words. That is, contains(*,'test*') should never match the word
test.
> > > >
> > > >
> > > > > All text (even ascii) is indexed as its unicode equivalent. Its
> > > > punctuation
> > > > > marks which are treated as white space and you get this problem
> with.
> > > >
> > > > Sadly this is rather an additional problem. Building a parser is
hard
> > job
> > > > when the precise definition of what is considered punctuation
symbols
> > and
> > > > indexable characters isn't available to the public :-(
> > > >
> > > >
> > > > Best regards
> > > > Johnny Nielsen
> > > >
> > > > >
> > > > > --
> > > > > Hilary Cotter
> > > > > Looking for a book on SQL Server replication?
> > > > > http://www.nwsu.com/0974973602.html
> > > > >
> > > > >
> > > > > "Johnny Nielsen" <DONT.WRITE.THIStrap402@megabit.dk> wrote in
> message
> > > > > news:u5Jc8EBeEHA.1356@TK2MSFTNGP09.phx.gbl...
> > > > > > "Hilary Cotter" <hilaryk@att.net> wrote in message
> > > > > > news:%23I4gCIAeEHA.3132@TK2MSFTNGP11.phx.gbl...
> > > > > > > No, I am saying Select * from tablename where
> contains(*,'test*')
> > > will
> > > > > > > return hits to test and test*.
> > > > > >
> > > > > > I'm sorry Hillary, but now I'm really confused. According to the
> MS
> > > > > > documentation that you seem to agree upon the second asterisk in
> > > > > > CONTAINS(*,'test*') isn't acting as a wildcard but is a part of
> the
> > > word
> > > > > > searched for.
> > > > > >
> > > > > > Then how come it will match the word test without an appended
> > asterisk
> > > ?
> > > > > >
> > > > > > That would be similar to CONTAINS(*,'server') matching the word
> > serve
> > > > > > without the appended 'r'.
> > > > > >
> > > > > >
> > > > > > > With Select * from tablename where contains(*,'"test*"') I
will
> > get
> > > > hits
> > > > > > to
> > > > > > > test, test*, and tested as it is now wildcarding
> > > > > >
> > > > > > I agree and I have now problem understanding the functionality
of
> > the
> > > > > > double-quotes - they seem to work as expected :-)
> > > > > >
> > > > > > It is only the functionality of the single-quoted cases that
> > troubles
> > > > me.
> > > > > >
> > > > > >
> > > > > > > You can't really change the behavior of the wildcarding by
> joining
> > > it
> > > > to
> > > > > > > other clauses, unless perhaps you do a join to a like that
looks
> > > like
> > > > > this
> > > > > > >
> > > > > > > like '% test %'
> > > > > >
> > > > > > OK, I just tried to find out what you meant by "the behavior
> changes
> > > if
> > > > > you
> > > > > > have multiple search arguements in your search phrase" .. you
> didn't
> > > > mean
> > > > > > that the behavior of the single-quoted term then.
> > > > > >
> > > > > >
> > > > > > PS: This has nothing to do with the current subject, but do you
> > happen
> > > > to
> > > > > > know the valid range of full-text searchable unicode characters
?
> I
> > > > can't
> > > > > > seem to find a clear definition in the MS documentation .. not
> even
> > a
> > > > > > definition of 'punctuation' symbols.
> > > > > >
> > > > > >
> > > > > > Best regards
> > > > > > Johnny Nielsen
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Hilary Cotter
> > > > > > > Looking for a book on SQL Server replication?
> > > > > > > http://www.nwsu.com/0974973602.html
> > > > > > >
> > > > > > >
> > > > > > > "Johnny Nielsen" <DONT.WRITE.THIStrap402@megabit.dk> wrote in
> > > message
> > > > > > > news:OXHRNu$dEHA.3864@TK2MSFTNGP10.phx.gbl...
> > > > > > > > "Hilary Cotter" <hilaryk@att.net> wrote in message
> > > > > > > > news:Oz0AhW8dEHA.3664@TK2MSFTNGP12.phx.gbl...
> > > > > > > > > This is correct for a single search argument in your
search
> > > > phrase.
> > > > > > > >
> > > > > > > > So, just to be sure I understand you correctly - if you
> perform
> > a
> > > > > SELECT
> > > > > > *
> > > > > > > > FROM SOMETABLE WHERE CONTAINS(*,'test*') you will only match
> > words
> > > > > > > > containing the five characters 'test*' and not the
> > four-character
> > > > word
> > > > > > > > 'test' ?
> > > > > > > >
> > > > > > > > On my system CONTAINS(*,'test*') and CONTAINS(*,' "test*" ')
> > seems
> > > > to
> > > > > > > return
> > > > > > > > the same results (it treats the asterisk in 'test*' as a
> > wildcard
> > > > > which
> > > > > > is
> > > > > > > > not what I want).
> > > > > > > >
> > > > > > > >
> > > > > > > > > The behavior changes if you have multiple search
arguements
> in
> > > > your
> > > > > > > search
> > > > > > > > phrase.
> > > > > > > >
> > > > > > > >
> > > > > > > > Can the behavior of the single-quoted simple term 'test*'
> change
> > > by
> > > > > > > joining
> > > > > > > > it to other arguments ?
> > > > > > > >
> > > > > > > > I know the meaning of double-quotes in prefix terms, but I'm
> > only
> > > > > > > interested
> > > > > > > > in the single-quoted non-prefix cases (I'm writing a
> > > > search-condition
> > > > > > > parser
> > > > > > > > which is why the specific workings are important to me).
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > With double quotation marks wrapping around your search
> phrase
> > > it
> > > > > will
> > > > > > > > > wildcard, so results will match the prefix or beginnings
> (what
> > > > comes
> > > > > > > > before
> > > > > > > > > the *), but will have different suffixes or endings, and
it
> > will
> > > > > match
> > > > > > > > with
> > > > > > > > > words or tokens in your table that has an * at the end.
The
> *
> > > will
> > > > > be
> > > > > > > > > applied to all search arguments.
> > > > > > > > >
> > > > > > > > > so a search on "al anon*" will match with al anon,
> alcoholics
> > > > > > anonymous,
> > > > > > > > > allan anont, but not and mike anon*.
> > > > > > > > >
> > > > > > > > > If you search without the double quotation marks and you
> have
> > > > > multiple
> > > > > > > > > search arguments
> > > > > > > > >
> > > > > > > > > ie
> > > > > > > > >
> > > > > > > > > select *from authors where contains(*,'al anon*')
> > > > > > > > >
> > > > > > > > > your query will bomb with
> > > > > > > > >
> > > > > > > > > Server: Msg 7631, Level 15, State 1, Line 1
> > > > > > > > > Syntax error occurred near 'anon*'. Expected ''''' in
search
> > > > > condition
> > > > > > > 'al
> > > > > > > > > anon*'.
> > > > > > > > >
> > > > > > > > > If you query on a single word or token with no double
> > quotation
> > > > > marks
> > > > > > > you
> > > > > > > > > will get not wildcarding, but will get an exact match to
the
> > > anon,
> > > > > and
> > > > > > > > > anon*.
> > > > > > > > >
> > > > > > > > > If you query on a single word or token and wrap it in
double
> > > > > quotation
> > > > > > > > marks
> > > > > > > > > you will get wildcarding and it will match with anon and
> > anon*.
> > > > > > > > >
> > > > > > > > > So for a single search word or token in your search phrase
> you
> > > > need
> > > > > to
> > > > > > > > wrap
> > > > > > > > > your search phrase in double quotes to get wildcarding.
> > > > > > > > >
> > > > > > > > > For multiple search arguments, words or tokens in your
> search
> > > > phrase
> > > > > > you
> > > > > > > > > need to wrap your search phrase in double quotes to get
> > > > wildcarding,
> > > > > > but
> > > > > > > > all
> > > > > > > > > search arguments will be wildcarded and you will get
matches
> > to
> > > > > tokens
> > > > > > > or
> > > > > > > > > words in your table ending with the * and matching the
> search
> > > > > argument
> > > > > > > > > exactly.
> > > > > > > > >
> > > > > > > > > In FreeText the * will be ignored, but will be return hits
> to
> > > > > matches
> > > > > > to
> > > > > > > > the
> > > > > > > > > search arguments which are stemmed or unstemmed, and end
in
> an
> > *
> > > > and
> > > > > > > don't
> > > > > > > > > end with the *.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Hilary Cotter
> > > > > > > > > Looking for a book on SQL Server replication?
> > > > > > > > > http://www.nwsu.com/0974973602.html
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > "Johnny Nielsen" <DONT.WRITE.THIStrap402@megabit.dk> wrote
> in
> > > > > message
> > > > > > > > > news:unzSUC6dEHA.1644@tk2msftngp13.phx.gbl...
> > > > > > > > > > To quote the MS SQL Server 2000's online book on
CONTAINS:
> > > > > > > > > >
> > > > > > > > > > "If the text and asterisk are not delimited by double
> > > quotation
> > > > > > marks,
> > > > > > > > as
> > > > > > > > > in
> > > > > > > > > > CONTAINS (column, 'text*'), full-text search considers
the
> > > > > asterisk
> > > > > > as
> > > > > > > a
> > > > > > > > > > character and will search for exact matches to text*."
> > > > > > > > > >
> > > > > > > > > > Can anybody confirm the above to be true ?
> > > > > > > > > >
> > > > > > > > > > It is my experience that MS documentation in general -
and
> > in
> > > > > > > particular
> > > > > > > > > > their examples - contains a lot of errors, but here they
> > > > > > specifically
> > > > > > > > > > mention that is should be possible to full-text search
for
> > > > 'text*'
> > > > > > not
> > > > > > > > > > counting the asterisk as a wildcard. When I try to
> full-text
> > > > > search
> > > > > > > for
> > > > > > > > > > 'text*' using CONTAINS the result includes records
> > containing
> > > > > 'text'
> > > > > > > > > without
> > > > > > > > > > the asterisk.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best regards
> > > > > > > > > > Johnny Nielsen
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Best regards
> > > > > > > > Johnny Nielsen
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>



Relevant Pages

  • Re: Interating over the characters in a string
    ... I have a question in regards to .Net string maniplulation. ... The problem is I have a CSV parser that will successfully parse out quoted ... pairs to reprisent a single character. ...
    (microsoft.public.dotnet.framework)
  • RE: search for words with capital letter at end
    ... Best regards, ... "FSt1" wrote: ... in ASCII, the uppercase M is character 77, the lower case m is character 109. ... between lower case and upper case. ...
    (microsoft.public.excel.misc)
  • Re: Special Character
    ... Regards. ... > hold down the alt key and enter 0171 from the keypad. ... > If I hold down alt and enter 174 I got that character as well, ... >> Outgoing mail is certified Virus Free. ...
    (microsoft.public.excel.programming)
  • Re: Single Character form entry in a table
    ... >>>When I have a single character form field, I still must TAB to the next ... Form fields simply don't respond that way. ... >> Regards, ... >> Jay Freedman ...
    (microsoft.public.word.tables)
  • Re: CONTAINS and non-wildcard asterisk
    ... The char.IsLetterOrDigit seems to do a fair job, but it isn't bulletprof. ... Best regards ... >> frameworks char.IsLetterOrDigit to determine if a character is ...
    (microsoft.public.sqlserver.fulltext)