Re: CONTAINS performance
From: John Kane (jt-kane_at_comcast.net)
Date: 10/28/04
- Next message: John Kane: "Re: get text near key words"
- Previous message: Dot net work: "Re: containstable, top_n_rank, and additional where clause combination causes unexpected result"
- In reply to: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Next in thread: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Reply: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 28 Oct 2004 15:29:02 -0700
You're welcome, Mal,
I'm going to answer your four questions first, as I suspect they are your
priority questions and additional comments will follow this posting on your
other questions/concerns.
Q1. How can I predict the scalability of the simple query given that I don't
have any large volume of data against which I can collect empirical
evidence? [Real data will likely only exist years after deployment and that
will be far too late to change the design!]
A1. In a nutshell, you cannot. However, what you can do is build a sample
database with publicly available text data from the Moby lexicon project
built by Grady Ward at http://www.dcs.shef.ac.uk/research/ilash/Moby/ and
then setup up a standard benchmarking test. Note, this data is freely
available as this data is in the public domain, per Grady Ward. SQL Server
2000 as well as other RDBMS vendors, such as ORACLE and IBM compete in
standard TPC Benchmarking tests in order to determine which database is the
fastest, etc. while using a standard test suite of tools, database schema
and data, for example TPC Benchmark C
(TPC-C -http://www.tpc.org/tpcc/detail.asp). The closest to a "Full Text
Search" TPC Benchmark is TCP-W (http://www.tpc.org/tpcw/default.asp), but
this too is more of a transactional web e-Commerce benchmark and not
strictly for FTS queries. I have long been kicking around the idea of
building a FTS Benchmarking toolkit along the lines of a TPC Benchmark suite
and even proposed it as a submitted abstract for the 2003 PASS conference
(it was not accepted, but my Text Mining abstract was accepted, go figure!).
Full Text Indexing (FTI) and Full Text Search (FTS) performance go hand in
hand along with the language of the text (Moby has word lists in five of
languages), the size (both row count and the amount of text per row) with
creating a matrix of tests that will not only measure the FTI performance,
but will measure the FTS with multiple clients issuing random FTS queries.
Additional factors, include both hardware and software configurations, for
example: the number, speed of the CPU's as well as the size and type of L
cache per CPU. Other hardware configurations, includes the amount of RAM,
the number of disk controllers as well as the type of raid disk drives and
where the database files and FT Catalog files are placed. As you can see
this is a non-trivial effort and one I plan on documenting for my book on
this subject. At this time, and until this SQL FTS Benchmarking Toolkit is
completely developed, I'd recommend that you download some of the Moby test
files and develop a test database and tables and load this data into it and
then use the Microsoft provide client toolkit called OSTRESS utility that
can be downloaded from
http://support.microsoft.com/default.aspx?scid=kb;en-us;887057 and use this
tool to randomly generate SQL FTS queries against your test database. While
this may not be exactly what your asking for, it is best solution for
predicting the scalability of the simple query at this time.
Q2. Does the above query actually perform as badly as my (very simple)
experiments suggested to me that it might in the context of a very large
number of records?
A2. Assuming you're referring to the query that ends with
"contains(A.*,'word') or contains(B.*,'word') or contains(C.*,'word')", and
without knowing how many rows are in each table, the configuration of your
server, the placement of your FT Catalog relative to your database files, it
is difficult to answer. Speculating and assuming, that you have a relative
large number of rows (+10,000) per table, I would agree that with even
larger row counts (+100,000 to +1,000,000), you would get worse performance
using CONTAINS vs. CONTAINSTABLE with Top_N_Rank. Additionally, with each
CONTAINS clause OR'ed, your resultset for this query, with a larger number
of rows (+100,000 to +1,000,000) would get larger results as each of the
tables grows and therefore worse performance as the tables grow.
Q3. Are there any ways to improve performance for the query with the simple
syntax - maybe by looking more closely at which primary key is used as a
basis for text indexing? Would such a minor change have any significant
impact on the SQLServer query optimizer.
A3. Yes, there are many ways of improving the performance of the simple
query and assuming you have the proper clustered and covering non-clustered
indexes on all tables as well as a significantly large number of rows for
SQL Server's Optimizer to utilize the clustered index and covering
non-clustered index, this query could be tuned up to a point. The best
method of determining this is to get a Query Execution plan of the query on
your server along with a row count for each table, for example (without the
where clause):
SET SHOWPLAN_ALL ON
GO
select top 100 * from
A left outer join
B on A.bid=B.id left outer join
C on A.cid=C.id
GO
SET SHOWPLAN_ALL OFF
GO
SHOWPLAN_ALL will provide all of the query plan information and along with
the row counts for tables A, B, and C, we can determine if the query without
the where clause and multiple CONTAINS statements is in fact using all of
the optimal indexes that are on each of the tables or if in fact a Table or
Index Scan is being performed. Once this is determined, and any changes
made, then add back the where clause and re-run the SHOWPLAN_ALL with the
where clause and multiple CONTAINS and compare the differences. If you want,
I can analyze the output and provide feedback on any changes and
recommendations. Once the optimal query is set, then we can re-write it to
include CONTAINSTABLE with and without Top_N_Rank and again determine if the
query is performing up to the optimal levels. Finally, gathering the
requested info and making any recommended change can have some impact on how
the SQL Server query optimizer executes this query in the optimal manner
(especially the row counts). However, since the FT Catalog reside outside of
SQL Server (remote scan), the SQL Server optimizer has no information on how
to improve the FTS query and cannot prevent the "round trip" for each
CONTAINS clause.
Q4. Does the up-and-coming SQLServer 2005 "move the goalposts?" Does it
optimize these queries substantially better than the SQLServer 2000 I'm
using today?
A4. Yes, very much so. While I have not independently verified this,
Microsoft has claimed that the FT Indexing performance for SQL Server 2005
(codename Yukon) has improved at least an order-of-magnitude better over SQL
Server 2000 and FT Search performance has also improved by at least 25% to
50% over SQL Server 2000 FTS query performance, assuming all other
conditions equal. Microsoft has indicate that they are working on a SQL
Server 2005 FTS Performance white paper, but as of now, I do not know what
will be the content and when it will become public. Additionally, the
following SQL Server 2005 (Yukon) FTS white paper was released in Dec 2003:
"SQL Server 2005 Full-Text Search: Internals and Enhancements" at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/sql2005ftsearch.asp.
I'd recommend that you read this and keep in mind that SQL Server 2005 is
still in a limited Beta (beta2) release at this time and is not expected to
be officially released until the first half of 2005. Note, it is available
for download for MSDN subscribers.
More to follow as I need to get some other tasks completed before I can more
fully reply to the rest of your posting!
Thanks,
John
"Mal Earnest" <malcolm.earnest@bom.co.uk> wrote in message
news:urEIA$NvEHA.3916@TK2MSFTNGP10.phx.gbl...
> Thanks for the far more detailed reply...
>
> I understand that the "top 100" restricted only the maximum size of the
result set and not the maximum effort involved when interacting with the
text indices. While I do understand the importance of worst case scenarios,
at present I'm primarily concerned with the 'normal use' case - where I
expect that 'word' will usually be something which occurs in the free-text
infrequently. I assume that the cost of interacting with the "FT Catalog"
will be greatly influenced by the frequency with which 'word' occurs in all
the records associated which share the same "text index" - and to a much
lesser extent upon the number of records that are referenced by the "text
index" (I assume text indices use a data structure with properties
comparable to those of a btree.)
>
> I was aware that CONTAINSTABLE was an alternative way to establish the
same results, and suspected that restricting the top-N here will improve my
worst case performance... though I'd previously avoided anything relating to
rank as my application has no meaningful way to rank my results... the
record my user is looking for will either be in the result set or not - and
skilled users of the end application are likely to be really good at picking
the best word to give the fewest results. :-) I assume if I associate no
ranking values the top-n by rank SQLServer will arbitrarily choose any top
N? That would be fine for my purposes. I doubt my top-N would ever exceed
100 - as that's more than a human can be bothered to manually read when
looking for 'the important record'.
>
> I knew that the free-text index is separate from 'conventional' indices -
and I understand your point that (presumably because the we don't know the
statistical distribution of vocabulary at optimisation/evaluation time) that
assumptions regarding how long each "round trip" to the free-text catalogue
might take are difficult to establish without very detailed information
about both the query and the data itself. Right now this issue (while
interesting) is not my main concern... I'd really like to know what plans
SQLServer would establish for queries aiming to find result rows where any
free-text column "contains" a search term. If SQLServer were to need to
evaluate every outer join I am pretty confident that performance would not
be acceptable.
>
> I'm pleased that we can make progress with my 'abstract question' - as,
while confidentiality is an issue, the real reason for my wanting to discuss
this simple abstract query is that I have dozens of different queries which
all share a similar form (most having more outer joins - but all requiring
the same 'any column in the result set contains this free-text pattern'
functionality.) All my queries are constructed on the fly for heavily
customisable tables from an OO framework supporting 'plug-ins'... so even I
can't narrow the problem down to one specific query and group of table
specifications. To complicate matters further I have to plan for a
deployment of a minimum terabyte scale... and not only do I not have any
realistic data but I wouldn't even have space to store it all on my
development platform! This is the reason I've been very reluctant to give
you concrete information - such as count(*)'s for my tables - I'd only be
using garbage data for test-runs... and I expect your random selection can
be just as realistic as mine! My hunch is that if we assumed a million+ rows
in each table (each tabular record representing the content of a
dull-bureaucratic form with open ended sections) that this would be an
estimate in the right ball-park. As far as the efficiency of each
round-trip evaluating each "Contains" clause against the "free-text
catalogue" I can safely assume that there would not be a problem if I had a
single table (instead of many outer joined tables). I feel I have a good
understanding of the performance implications should there be a single
table - and I'm trying to establish the additional implications arising from
the use of outer-joins. I'm doing my experiments with SQLServer 2000
8.00.760 - however I don't consider this especially important. If a newer
SQLServer significantly improves performance when executing the queries in
their most natural form - then that will become the minimum deployment
specification - newer SQLServer licences will be far cheaper than re-coding
the generation of all these queries and the increased cost of maintaining a
more involved code-base.
>
> I've looked at the script you posted - and while your tables are simple...
I'm sure they will suffice for illustrative purposes. I noticed that no
calls were made to sp_fulltext_table or sp_fulltext_column - which prevents
me from simply running the script - but that is easily resolved. I was more
concerned to see the comment suggesting that the original problem used inner
join... but assume for now that was a just an aberration. The next
modification is interesting - but it generates a result set with the same
number of rows as that in the Cartesian product of the matching rows from
each table. For the non-degenerate case this looks substantially different
to both the inner-join query and my original outer-join one... for which I
anticipate diabolical performance should there be more than a handful of
matches in your ItemTitles and ItemHardware tables... distinct would have a
lot of candidates to whittle down. (I realise that I hinted that the word
would likely be 'unusual' - but my hunch is that it won't be sufficiently
unusual to make this overhead affordable.) The final query seems similar to
the previous suggestion - and I suspect this too would suffer serious
performance problems when there are several matches... I also suspect that
the problem would be significantly exacerbated in my real-life queries where
I have many more outer joins.
>
> If I were to re-write my original query with a view to ensuring better
scalability I would have chosen something like this (please forgive any
syntax glitches - this is intended as pseudo-code SQL):
>
> select top 100 * from
> ( Select X.*, B.*, C.* from
> CONTAINSTABLE(A,*,'word') as X left outer join
> B on X.bid=B.id left outer join
> C on X.cid=C.id
> UNION
> select A.*,Y.*,C.* from
> A inner join
> CONTAINSTABLE(B,*,'word') as Y on A.bid=Y.id left outer join
> C on A.cid=C.id
> UNION
> select A.*,B.*,Z.* from
> A left outer join
> B on A.id=B.id inner join
> CONTAINSTABLE(C,*,'word') as Z
> ) ;
>
> Each of these three inner select expressions look to me as if they should
evaluate without any significant optimisation problems... hence making the
worst case for this query the sum of these three individual 3 worst cases.
>
> Discussion of this strategy, however, avoids my most pressing questions -
which still concern the original simple query:
>
> select top 100 * from
> A left outer join
> B on A.bid=B.id left outer join
> C on A.cid=C.id
> where
> contains(A.*,'word') or contains(B.*,'word') or contains(C.*,'word')
>
> I would prefer to use the simple query because it is trivial to construct
in my middle tier - hence will less likely contain obscure bugs and will be
easier to maintain if it does. Of course - if the simple query does not
scale (and can't easily be convinced to scale) then alternatives must be
considered. For this reason I still need to ascertain:
>
> 1. How can I predict the scalability of the simple query given that I
don't have any large volume of data against which I can collect empirical
evidence? [Real data will likely only exist years after deployment and that
will be far too late to change the design!]
> 2. Does the above query actually perform as badly as my (very simple)
experiments suggested to me that it might in the context of a very large
number of records?
> 3. Are there any ways to improve performance for the query with the simple
syntax - maybe by looking more closely at which primary key is used as a
basis for text indexing? Would such a minor change have any significant
impact on the SQLServer query optimiser.
> 4. Does the up-and-coming SQLServer 2005 "move the goalposts?" Does it
optimise these queries substantially better than the SQLServer 2000 I'm
using today?
>
> Thanks for your effort so far - but I'd still like to be able to answer
the above 4 more specific points.
>
> **********************************************************************
> Sent via Fuzzy Software @ http://www.fuzzysoftware.com/
> Comprehensive, categorised, searchable collection of links to ASP &
ASP.NET resources...
- Next message: John Kane: "Re: get text near key words"
- Previous message: Dot net work: "Re: containstable, top_n_rank, and additional where clause combination causes unexpected result"
- In reply to: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Next in thread: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Reply: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|