Re: CONTAINS performance
From: malcolm.earnest_at_bom.co.uk ((malcolm.earnest_at_bom.co.uk))
Date: 10/28/04
- Next message: Hilary Cotter: "Re: containstable, top_n_rank, and additional where clause combination causes unexpected result"
- Previous message: Hilary Cotter: "Re: get text near key words"
- In reply to: John Kane: "Re: CONTAINS performance"
- Next in thread: John Kane: "Re: CONTAINS performance"
- Reply: John Kane: "Re: CONTAINS performance"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 28 Oct 2004 04:17:44 -0700
Thanks for the far more detailed reply...
I understand that the "top 100" restricted only the maximum size of the result set and not the maximum effort involved when interacting with the text indices. While I do understand the importance of worst case scenarios, at present I'm primarily concerned with the 'normal use' case - where I expect that 'word' will usually be something which occurs in the free-text infrequently. I assume that the cost of interacting with the "FT Catalog" will be greatly influenced by the frequency with which 'word' occurs in all the records associated which share the same "text index" - and to a much lesser extent upon the number of records that are referenced by the "text index" (I assume text indices use a data structure with properties comparable to those of a btree.)
I was aware that CONTAINSTABLE was an alternative way to establish the same results, and suspected that restricting the top-N here will improve my worst case performance... though I'd previously avoided anything relating to rank as my application has no meaningful way to rank my results... the record my user is looking for will either be in the result set or not - and skilled users of the end application are likely to be really good at picking the best word to give the fewest results. :-) I assume if I associate no ranking values the top-n by rank SQLServer will arbitrarily choose any top N? That would be fine for my purposes. I doubt my top-N would ever exceed 100 - as that's more than a human can be bothered to manually read when looking for 'the important record’.
I knew that the free-text index is separate from 'conventional' indices - and I understand your point that (presumably because the we don't know the statistical distribution of vocabulary at optimisation/evaluation time) that assumptions regarding how long each "round trip" to the free-text catalogue might take are difficult to establish without very detailed information about both the query and the data itself. Right now this issue (while interesting) is not my main concern... I'd really like to know what plans SQLServer would establish for queries aiming to find result rows where any free-text column "contains" a search term. If SQLServer were to need to evaluate every outer join I am pretty confident that performance would not be acceptable.
I'm pleased that we can make progress with my 'abstract question' – as, while confidentiality is an issue, the real reason for my wanting to discuss this simple abstract query is that I have dozens of different queries which all share a similar form (most having more outer joins - but all requiring the same 'any column in the result set contains this free-text pattern' functionality.) All my queries are constructed on the fly for heavily customisable tables from an OO framework supporting 'plug-ins'... so even I can't narrow the problem down to one specific query and group of table specifications. To complicate matters further I have to plan for a deployment of a minimum terabyte scale... and not only do I not have any realistic data but I wouldn't even have space to store it all on my development platform! This is the reason I've been very reluctant to give you concrete information - such as count(*)'s for my tables - I'd only be using garbage data for test-runs... and I expect your random selection can b
e just as realistic as mine! My hunch is that if we assumed a million+ rows in each table (each tabular record representing the content of a dull-bureaucratic form with open ended sections) that this would be an estimate in the right ball-park. As far as the efficiency of each round-trip evaluating each "Contains" clause against the "free-text catalogue" I can safely assume that there would not be a problem if I had a single table (instead of many outer joined tables). I feel I have a good understanding of the performance implications should there be a single table - and I'm trying to establish the additional implications arising from the use of outer-joins. I'm doing my experiments with SQLServer 2000 8.00.760 - however I don't consider this especially important. If a newer SQLServer significantly improves performance when executing the queries in their most natural form - then that will become the minimum deployment specification - newer SQLServer licences will be far cheaper than re-coding the generati
on of all these queries and the increased cost of maintaining a more involved code-base.
I've looked at the script you posted - and while your tables are simple... I'm sure they will suffice for illustrative purposes. I noticed that no calls were made to sp_fulltext_table or sp_fulltext_column - which prevents me from simply running the script - but that is easily resolved. I was more concerned to see the comment suggesting that the original problem used inner join... but assume for now that was a just an aberration. The next modification is interesting - but it generates a result set with the same number of rows as that in the Cartesian product of the matching rows from each table. For the non-degenerate case this looks substantially different to both the inner-join query and my original outer-join one... for which I anticipate diabolical performance should there be more than a handful of matches in your ItemTitles and ItemHardware tables... distinct would have a lot of candidates to whittle down. (I realise that I hinted that the word would likely be ‘unusual’ – but my hunch is that it wo
n’t be sufficiently unusual to make this overhead affordable…) The final query seems similar to the previous suggestion - and I suspect this too would suffer serious performance problems when there are several matches... I also suspect that the problem would be significantly exacerbated in my real-life queries where I have many more outer joins.
If I were to re-write my original query with a view to ensuring better scalability I would have chosen something like this (please forgive any syntax glitches - this is intended as pseudo-code SQL):
select top 100 * from
( Select X.*, B.*, C.* from
CONTAINSTABLE(A,*,'word') as X left outer join
B on X.bid=B.id left outer join
C on X.cid=C.id
UNION
select A.*,Y.*,C.* from
A inner join
CONTAINSTABLE(B,*,'word') as Y on A.bid=Y.id left outer join
C on A.cid=C.id
UNION
select A.*,B.*,Z.* from
A left outer join
B on A.id=B.id inner join
CONTAINSTABLE(C,*,'word') as Z
) ;
Each of these three inner select expressions look to me as if they should evaluate without any significant optimisation problems... hence making the worst case for this query the sum of these three individual 3 worst cases.
Discussion of this strategy, however, avoids my most pressing questions - which still concern the original simple query:
select top 100 * from
A left outer join
B on A.bid=B.id left outer join
C on A.cid=C.id
where
contains(A.*,'word') or contains(B.*,'word') or contains(C.*,'word')
I would prefer to use the simple query because it is trivial to construct in my middle tier - hence will less likely contain obscure bugs and will be easier to maintain if it does. Of course - if the simple query does not scale (and can't easily be convinced to scale) then alternatives must be considered. For this reason I still need to ascertain:
1. How can I predict the scalability of the simple query given that I don't have any large volume of data against which I can collect empirical evidence? [Real data will likely only exist years after deployment and that will be far too late to change the design!]
2. Does the above query actually perform as badly as my (very simple) experiments suggested to me that it might in the context of a very large number of records?
3. Are there any ways to improve performance for the query with the simple syntax - maybe by looking more closely at which primary key is used as a basis for text indexing? Would such a minor change have any significant impact on the SQLServer query optimiser.
4. Does the up-and-coming SQLServer 2005 "move the goalposts?" Does it optimise these queries substantially better than the SQLServer 2000 I'm using today?
Thanks for your effort so far – but I’d still like to be able to answer the above 4 more specific points.
**********************************************************************
Sent via Fuzzy Software @ http://www.fuzzysoftware.com/
Comprehensive, categorised, searchable collection of links to ASP & ASP.NET resources...
- Next message: Hilary Cotter: "Re: containstable, top_n_rank, and additional where clause combination causes unexpected result"
- Previous message: Hilary Cotter: "Re: get text near key words"
- In reply to: John Kane: "Re: CONTAINS performance"
- Next in thread: John Kane: "Re: CONTAINS performance"
- Reply: John Kane: "Re: CONTAINS performance"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|