Re: CONTAINS performance

From: malcolm.earnest_at_bom.co.uk ((malcolm.earnest_at_bom.co.uk))
Date: 10/27/04


Date: Wed, 27 Oct 2004 04:41:30 -0700

First and foremost, John and Hillary thank you both for your responses... I've not established what I set out to discover as yet, but there were several useful ideas, and you both asked me to clarify my question.

When I initially asked about performance - I suppose I should have made it clearer that I am primarily concerned by scalability. I can put-up with the wall-clock speed of execution at present but I need to establish that as the number of records grows over time that my solution won't grind to a halt.

The advice to limit the number of results was good - I'd omitted that for brevity - the actual form of query which interests me includes such a limit - I'd have better stated the query:

            Select top 100 * from
              A left outer join B on A.bid=B.id
              left outer join C on A.cid=C.id
            Where
              contains(A.*, ‘word’)
              or contains (B.*, ‘word’)
              or contains (C.*, ‘word’)

Assume that the ‘word’ I choose to search upon is sufficiently unusual to limit my result-set after application of the where clause to between, say, 0 and 10 rows and that all these rows must be returned to my client – hence for the circumstances I was considering the TOP 100 would likely have no significant effect on performance. As you might have guessed, the query I posted is a significant simplification of a series of real world problems (hence the generic names) the detailed explanation of which is way beyond the scope of a post here. In my example the relationship from A to B and from A to C is 1..n. For some of the tables I actually use B.id is unique in B; in some C.id unique in C; for some both are unique; for some neither. Many queries and updates within the system only act upon these tables individually (hence fixing the table structure.) I also require support for data mining searches for which the result set matches that returned for the query above. While I can see a potential for improving
performance by using a simplified table structure (and avoiding the complexity of evaluation strategy introduced by left-outer-join) this is not practical as a consequence of other binding project constraints. I need to keep the distinct tables I introduced as A, B and C – their relationship can be considered “set-in-stone.” I realise that a single table would eliminate concerns about performance but that idea is not an option for this project.

I’m still a little unclear about your terminology (specifically “scans of the FTS catalogs” and “table scans”) I suspect that this relate closely to what I consider the primary performance concern.

I assume by "table scans" you suggest that SQLServer might evaluate both outer joins for each record in A then establish if the candidate result rows match any of the three contains clauses? If this is the case then the performance would be worse than linear in the number of records in A – and hence exhibit relatively poor scalability.

I assume “scans of the FTS catalogs” you mean that SQLServer might use the text index to efficiently establish rows from A matching the contains predicate ranging over A; outer join those rows with B and C to get the initial result rows; establish in B using the text index rows matching the contains predicate ranging over B then inner join those results with A, then outer join that result with C; finally use the text index for C to efficiently establish the remaining rows in C before inner joining those with A and outer joining with B – all the while carefully arranging not to return the exact same row twice? This would be a workable strategy but it feels ‘unlikely’ as it both fails to explain observed performances during my (extremely limited and potentially very error prone) experiments and because when I specified two OUTER JOIN expressions in my query I wouldn’t have expected the optimiser to implement this using two inner-join and four outer-join relationships unless I re-write my query to explicitly h
int that this is the way I need the query to be evaluated. While this approach to establish the result set is relatively complex, it offers far better scalability – especially if table A happens to have a large number of records.

I'm not in a position to give concrete sp_help output - primarily as this question is a massive simplification of the actual queries I use... Tables typically have 20-50 fields of various types (CLOB; BLOB; integer; date; etc. etc.); All tables have an integer primary key and in example above assume indices are defined on B.id and C.id - each of which is can be assumed ordinal (say integer). The textual information is free-text enabled in every table.

I didn't understand John's stored procedure - but the idea of a stored procedure to make explicit an evaluation strategy for the desired result-set is certainly an interesting and technically plausible possibility... all be it one which would prove quite a challenge in the context of my project!

I'm not encountered the "Best Practices Analyzer Tool" - though the advice on http://www.sqldude.4t.com/rules.html looks familiar enough. Is this tool supplied with SQLServer or is it 3rd party product? Would this tool tell me how my query is being optimised?

At this stage I'm trying to establish if my hunch is accurate about the scalability of queries of the type illustrated above. I am aware that the three contains clauses would be detrimental to performance - and I am aware that if I had a single flat table that any performance issues would likely disappear. I'm also aware that I could use an alternative sequence of queries to efficiently establish the same result rows - but it would be a big task to re-write my query generator to make this optimisation and I am reluctant to do this if there is an alternative. Queries of the form above work OK for my current test data on a quick server - (hundreds of records) but I want a solution which will scale to support hundreds of thousands to millions of records. I'm still not clear how the query above will be optimised – and, as such, I'm still not in a position to estimate scalability. I can’t use a single table and avoid outer-joins – but I might be able to alter the way in which the tables are indexed if that w
ould influence the optimiser in a positive way.

**********************************************************************
Sent via Fuzzy Software @ http://www.fuzzysoftware.com/
Comprehensive, categorised, searchable collection of links to ASP & ASP.NET resources...



Relevant Pages

  • Re: You think you have it bad
    ... Then there was the loud noise of the responses. ... > I don't make a habit of surfing the Microsoft VB groups but do remember ... > Anders Ohlsson and John Kaster were encouraging folks to visit the VB ...
    (borland.public.delphi.non-technical)
  • Re: total value of orders
    ... I hope now everyone who reads this sees the idiocy of Arno R and John ... Arno R has responded 8 times in this thread and John Marshall has ... and touting peer-to-peer support but in the end with 14 responses and ... the ideal output would be a sheet for "supplier x" listing each order ...
    (comp.databases.ms-access)
  • Re: CE 5.0 DHCP problems with short lease times
    ... the NAK is completely ignored. ... down, followed by a DISCOVER, etc.. ... With other responses, it seems to be OK at ... DHCP client just does a simple divide by two on the lease time (>> ...
    (microsoft.public.windowsce.embedded)
  • Re: Royalty for Commoners
    ... > writing the article, I got responses from you, John, that ranged from 'how ... I remember writing that one. ... > to 'you have done such an incredible job that my [John Brandon] name should ... > has a grandiose sense of self-importance (e.g., EXAGGERATES ACHIEVEMENTS ...
    (soc.genealogy.medieval)
  • Re: Excuse Me...
    ... thinks he could be cast as Clete if they were looking for movies and his ... | John P - who shares a last name with Clete. ...
    (rec.arts.mystery)

Loading