Re: CONTAINS performance
From: John Kane (jt-kane_at_comcast.net)
Date: 10/29/04
- Previous message: Jason Robertson: "Re: DIFFICULT PROBLEM! SSL for SQL 2000 Server. MS Fix bulletin does not help at all."
- In reply to: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Next in thread: Hilary Cotter: "Re: CONTAINS performance"
- Messages sorted by: [ date ] [ thread ]
Date: Fri, 29 Oct 2004 16:25:34 -0700
You're welcome, Mal,
Good, we agree on being able to predict scalability and it's host of hidden
gotchas! That said, and with the query plan, I can start to give you more
explicit feedback on how to improve the "simple" query including the where
clause. While a row count of the tables would of been of more help, I have
enough to go on to tell you how to improve the performance of this simple
query, although I'm assuming very few (& 100 rows is considered very few) in
each table and that the recommendations are "fixes" for the low row count
and *force* the optimizer to pick a path that it would not normally do
considering the low row counts.
First of all, relative to your concern with the with performance of the
relational join in the context of the free-text optimization, internally SQL
Server "re-writes" the query to satisfy the syntax of the MSSearch engine,
passes this re-written query to the MSSearch engine and then results are
returned to SQL Server and then & only then are joined again with the actual
SQL Server tables. There is no practical way to optimize the written
MSSearch syntax as it is internal to SQL Server & MSSearch and not available
to normal SQL Server 2000 optimizer hints. Note, this is all necessary
because the FT Catalog structures and access methods are stored outside of
SQL Server and therefore not available to the SQL Server optimizer.
Secondly, it is incorrect to assume that "each free-text search will
identify only a few records - even if the tables contain many millions". The
free-text search components (the CONTAINS statement and search words &/or
phrases) are passed directly to the MSSearch engine BEFORE joining with the
SQL Server table data and ONLY when ALL of the FT Catalog is read for EACH
contains statement and when all of the data is passed back to SQL Server,
then this data is joined with the SQL Server 2000 table data and the where
clause is applied. Furthermore, it is incorrect to assume that "SQLServer
performs the join operation for every record in the tables before
restricting based upon the free-text criteria" as all of the search words
are passed to the MSSearch engine (for each contains statement) and then all
results are passed back to SQL Server.
Another incorrect assumption for SQL Server 2000 is that SQL Server 2000 and
this most likely will change for SQL Server 2005 is that SQL Server 2000
does not "access the records themselves", but instead allows the MSSearch
engine to do this and as an external service it (SQL Server) cannot do a
"join on the indices and comparing those key values with the free-text
matches". Yes, this would improve performance and most likely this is what
has be designed into SQL Server 2005 as the MSSearch Indexing Engine has
been brought inside of SQL Server 2005, while in SQL Sever 2000, the
MSSearch Indexing engine remains an external service. My key point here is
that many of your assumptions are incorrect as the SQL Server optimize has
no access to the external MSSearch structures & indexes.
Thank you for supplying the simple query execution plan including the where
clause with the CONTAINS statements and yes they can be difficult to read as
well as to understand as I'm assuming that most of your very knowledgeable
SQL join behaviors most likely comes from ORACLE (or IBM?). Correct? SQL
Server query execution plans have their own nuisances and tricks to
understanding them. The biggest issue with the pure T-SQL join portion
(excluding the Remote Scan for the contains clauses) of this query is the
fact that for so few records (100 rows), and for small row width (AvgRowSize
= 4) even with Clustered Indexes on the Primary Key (column name PK) for
each of the tables, SQL Server will almost always select a "Clustered Index
Scan", i.e.. reading all rows from each table vs. just accessing the
Clustered Index, unless forced to by an optimizer hint such as "INDEX = " as
this instructs SQL Server to use the specified indexes for a table. (see SQL
Server 2000 Books Online title "Hints" for more info on table hints). For
such small tables, you can force the use of the Clustered Indexes via the
following modified query:
Select * from A (index = 1) left outer join B (index = 1) on A.id=B.id left
outer join C (index = 1) on A.id=C.id
Where contains(A.*,'word') or contains (B.*,'word') or contains
(C.*,'word');
Furthermore, even with the table hints "(index = 1)" on each table and with
so few rows (approx. 100) for this query, the SQL Sever 2000 optimizer, will
most likely continue to do a "Clustered Index Scan" as it is designed to
determine the best query plan and in this case, since all rows for each
table most likely fit in one or two data pages, it can scan all rows faster
than it can via accessing the clustered index for each table. Additionally,
a "MANY-TO-MANY MERGE ([A].[ID])=([B].[ID])" is occurring as well and this
too is impacting the performance of this query as "A many-to-many merge join
uses a temporary table to store rows." - see "Understanding Merge Joins" at
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/optimsql/odp_tun_1_5alv.asp
for more details. Below are some addition information I gleamed from the
query plan and references:
"Row Count Spool" & "Left Outer Join" with EstimateRows = 1.0 and AvgRowSize
= 4
"Row Count Spool" & "Left Semi Join" with EstimateRows = 1.0 and AvgRowSize
= 4
"Row Count Spool" & "Left Semi Join" with EstimateRows = 1.0 and AvgRowSize
= 4
see
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/optimsql/odp_tun_1_1nak.asp
and
see
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_365o.asp
see also "Large Data Operations in SQL Server" -
http://www.sql-server-performance.com/jc_large_data_operations.asp for more
info on query plans for larger tables. In the end, it's best to scale up
your test to at least 10,000 rows and then ensure that the PK Clustered
index statistics are up-to-date and then re-test & review your simple query
execution plan for "Clustered Index Seek" to confirm that the SQL Server
optimizer is effectively using the PK Clustered indexes & this is why I
wanted to know the row counts for each table....
Yes, you are correct that with this few of rows, all of the relevant data
could easily fit into RAM, and more importantly in SQL Server cache, however
you should execute the above query at least 3 times in order to "warm up"
the cache and then take the average or the last query as your query's
performance time. Q. Am I interpreting this plan correctly in assuming it
states that the outer-join is computed considering at least the keys for
every record in A? Simple, answer yes, but as I said this query is not using
all the best features of SQL Sever optimizer.
Enjoy!
John
"Mal Earnest" <malcolm.earnest@bom.co.uk> wrote in message
news:eu9Zt6bvEHA.4048@TK2MSFTNGP15.phx.gbl...
> Thank you again for your time and effort. there was a lot of useful
information there.
>
> With respect to predicting scalability I understand your argument that
without the actual data that there can be a whole host of hidden "gotcha"
situations. but I really do need to estimate this issue of scalability
despite an accurate simulation being all-but impossible. I'm familiar with
the TPC-C; TPC-W and generic stress tests etc, and I'm aware that random
test data can be generated from word-lists etc. - but experience has shown
that even after much effort this data still exhibits statistical
distributions. These techniques will likely prove useful - though I hope to
only go to that level of effort once I've implemented a system I believe
should scale well.
>
> Regarding performance of the query itself, I recognise your concern with
the computational expense of the remote scans of the free-text catalogue -
however my own concerns are primarily with performance of the relational
join in the context of the free-text optimisation. I know my query
suggested a search for an ordinary dictionary word - but in practice I
expect the facility to be used to search for complex associations between
words - and/or unusual proper nouns or distinctive codes like ISBNs; car
registration plates or Swiss bank account numbers! With this in mind I
expect that each free-text search will identify only a few records - even if
the tables contain many millions. If SQLServer performs the join operation
for every record in the tables before restricting based upon the free-text
criteria then (IMHO) this will likely prove a scalability issue. If
SQLServer is able to determine a mechanism to read result-set rows more
efficiently (i.e. without accessing the records themselves but rather by
performing the join on the indices and comparing those key values with the
free-text matches) then this will improve performance and make possible
larger scale deployments. I suspect, however, that the root problem
remains - i.e. that every primary key in A is considered to resolve the
outer join. (which I suspect should be unnecessary.) While I also need to
be concerned about the increasing cost of accessing the free-text indices -
the fact that I expect typical search terms to be 'well-chosen' encourages
me to consider the potential requirement to compute the whole outer join
(even if this is just over indexes and not the larger records themselves) to
be the most pressing concern.
>
> I've been looking at execution plans, but should admit that so far I'm
finding them difficult to interpret. For example, I have some data for a
query similar to my original post (but where indices are arranged to be
slightly more amenable to optimisation.) I've been analysing:
>
> Select * from A left outer join B on A.id=B.id left outer join C on
A.id=C.id
> Where contains(A.*,'word') or contains (B.*,'word') or contains
(C.*,'word');
>
> The wall-clock timings for this query are not particularly helpful -
sometimes taking several seconds (presumably when indices are not cached)
but with repeated execution executing quickly on successive evaluations. It
is important to note that this test was performed with only a few hundred
records - whereas I ultimately want to support thousands to millions, and
that for my test all the relevant data could easily fit into RAM. At the
end of this post is the output from showplan_all - about which I'd welcome
any comments. Am I interpreting this plan correctly in assuming it states
that the outer-join is computed considering at least the keys for every
record in A?
>
> Thanks for the pointer to the whitepaper on SQLServer 2005 - it was
interesting - though I think it best to defer judgement on the performance
improvements until I see them for my own queries and schema. The advances
certainly appear to offer substantial benefits but I doubt I can be sure
what it means for me until I find opportunity to at least trial a Beta.
>
> SET SHOWPLAN_ALL ON 3 1 0 NULL NULL 1 NULL NULL NULL NULL NULL NULL NULL
NULL SETON 0 NULL
> select * from A left outer join B on B.ID = A.ID left outer join C on
C.ID = A.ID where contains(A.*, '"word"') or contains(B.*, '"word"') or 4
2 0 NULL NULL 2 NULL 549.56439 NULL NULL NULL 1.5461341 NULL NULL SELECT 0
NULL
> |--Compute Scalar(DEFINE:([A].[FC]=[A].[FC], [A].[FU]=[A].[FU],
[B].[FU]=[B].[FU], [C].[FU]=[C].[FU])) 4 3 2 Compute Scalar Compute Scalar
DEFINE:([A].[FC]=[A].[FC], [A].[FU]=[A].[FU], [B].[FU]=[B].[FU],
[C].[FU]=[C].[FU]) [A].[FC]=[A].[FC], [A].[FU]=[A].[FU], [B].[FU]=[B].[FU],
[C].[FU]=[C].[FU] 549.56439 0.0 5.4956436E-5 2728 1.5461341 [A].[ID],
[A].[FD], [A].[FB], [A].[FC], [A].[FA], [A].[FT], [A].[FN], [A].[FL],
[A].[F_], [A].[FV], [A].[FI], [ NULL PLAN_ROW 0 1.0
> |--Filter(WHERE:(([Expr1012] OR [Expr1013]) OR [Expr1014])) 4 5 3
Filter Filter WHERE:(([Expr1012] OR [Expr1013]) OR [Expr1014]) NULL
549.56439 0.0 3.1874733E-4 2745 1.5460792 [A].[ID], [A].[FD], [A].[FB],
[A].[FC], [A].[FA], [A].[FT], [A].[FN], [A].[FL], [A].[F_], [A].[FV],
[A].[FI], [ NULL PLAN_ROW 0 1.0
> |--Nested Loops(Left Semi Join, WHERE:([Expr1012] OR
[Expr1013])OUTER REFERENCES:([C].[FO]), DEFINE:([Expr1014] = [PROBE VALUE]))
4 6 5 Nested Loops Left Semi Join WHERE:([Expr1012] OR [Expr1013])OUTER
REFERENCES:([C].[FO]), DEFINE:([Expr1014] = [PROBE VALUE]) [Expr1014] =
[PROBE VALUE] 549.56439 0.0 2.5524213E-3 2745 1.5457604 [A].[ID], [A].[FD],
[A].[FB], [A].[FC], [A].[FA], [A].[FT], [A].[FN], [A].[FL], [A].[F_],
[A].[FV], [A].[FI], [ NULL PLAN_ROW 0 1.0
> |--Nested Loops(Left Semi Join, WHERE:([Expr1012])OUTER
REFERENCES:([B].[FS]), DEFINE:([Expr1013] = [PROBE VALUE])) 4 7 6 Nested
Loops Left Semi Join WHERE:([Expr1012])OUTER REFERENCES:([B].[FS]),
DEFINE:([Expr1013] = [PROBE VALUE]) [Expr1013] = [PROBE VALUE] 610.62708 0.0
2.8360237E-3 2745 1.1154289 [A].[ID], [A].[FD], [A].[FB], [A].[FC],
[A].[FA], [A].[FT], [A].[FN], [A].[FL], [A].[F_], [A].[FV], [A].[FI], [ NULL
PLAN_ROW 0 1.0
> | |--Nested Loops(Left Semi Join, OUTER
REFERENCES:([A].[ID]), DEFINE:([Expr1012] = [PROBE VALUE])) 4 8 7 Nested
Loops Left Semi Join OUTER REFERENCES:([A].[ID]), DEFINE:([Expr1012] =
[PROBE VALUE]) [Expr1012] = [PROBE VALUE] 678.47455 0.0 3.1511374E-3 2745
0.68396938 [A].[ID], [A].[FD], [A].[FB], [A].[FC], [A].[FA], [A].[FT],
[A].[FN], [A].[FL], [A].[F_], [A].[FV], [A].[FI], [ NULL PLAN_ROW 0 1.0
> | | |--Merge Join(Left Outer Join, MANY-TO-MANY
MERGE:([A].[ID])=([B].[ID]), RESIDUAL:([B].[ID]=[A].[ID])) 4 9 8 Merge Join
Left Outer Join MANY-TO-MANY MERGE:([A].[ID])=([B].[ID]),
RESIDUAL:([B].[ID]=[A].[ID]) NULL 753.86066 0.04586786 1.9198203E-2 2744
0.26548451 [A].[ID], [A].[FD], [A].[FB], [A].[FC], [A].[FA], [A].[FT],
[A].[FN], [A].[FL], [A].[F_], [A].[FV], [A].[FI], [ NULL PLAN_ROW 0 1.0
> | | | |--Merge Join(Left Outer Join,
MERGE:([A].[ID])=([C].[ID]), RESIDUAL:([C].[ID]=[A].[ID])) 4 10 9 Merge Join
Left Outer Join MERGE:([A].[ID])=([C].[ID]), RESIDUAL:([C].[ID]=[A].[ID])
NULL 424.98514 0.0 7.0256959E-3 2249 0.12597823 [A].[ID], [A].[FD],
[A].[FB], [A].[FC], [A].[FA], [A].[FT], [A].[FN], [A].[FL], [A].[F_],
[A].[FV], [A].[FI], [ NULL PLAN_ROW 0 1.0
> | | | | |--Clustered Index
Scan(OBJECT:([Cat].[dbo].[A].[PK]), ORDERED FORWARD) 4 11 10 Clustered Index
Scan Clustered Index Scan OBJECT:([Cat].[dbo].[A].[PK]), ORDERED FORWARD
[A].[ID], [A].[FD], [A].[FB], [A].[FC], [A].[FA], [A].[FT], [A].[FN],
[A].[FL], [A].[F_], [A].[FV], [A].[FI], [ 312.0 5.0171092E-2 4.2170001E-4
1233 5.0592791E-2 [A].[ID], [A].[FD], [A].[FB], [A].[FC], [A].[FA],
[A].[FT], [A].[FN], [A].[FL], [A].[F_], [A].[FV], [A].[FI], [ NULL PLAN_ROW
0 1.0
> | | | | |--Sort(ORDER BY:([C].[ID] ASC)) 4 12
10 Sort Sort ORDER BY:([C].[ID] ASC) NULL 321.0 1.1261261E-2 4.270569E-3
1026 6.8356745E-2 [C].[FO], [C].[FQ], [C].[ID], [C].[FG], [C].[FX],
[C].[FP], [C].[FJ], [C].[FW], [C].[FU], [C].[FK], [C].[X_D NULL PLAN_ROW 0
1.0
> | | | | |--Clustered Index
Scan(OBJECT:([Cat].[dbo].[C].[PK])) 4 13 12 Clustered Index Scan Clustered
Index Scan OBJECT:([Cat].[dbo].[C].[PK]) [C].[FO], [C].[FQ], [C].[ID],
[C].[FG], [C].[FX], [C].[FP], [C].[FJ], [C].[FW], [C].[FU], [C].[FK],
[C].[X_D 321.0 2.6196657E-2 0.0002158 1042 5.2824914E-2 [C].[FO], [C].[FQ],
[C].[ID], [C].[FG], [C].[FX], [C].[FP], [C].[FJ], [C].[FW], [C].[FU],
[C].[FK], [C].[X_D NULL PLAN_ROW 0 1.0
> | | | |--Sort(ORDER BY:([B].[ID] ASC)) 4 17 9
Sort Sort ORDER BY:([B].[ID] ASC) NULL 714.0 1.1261261E-2 1.0659463E-2 504
7.4437201E-2 [B].[FS], [B].[ID], [B].[FM], [B].[FY], [B].[FZ], [B].[FH],
[B].[FR], [B].[FE], [B].[FF], [B].[F_], [B].[FJ], [X NULL PLAN_ROW 0 1.0
> | | | |--Clustered Index
Scan(OBJECT:([Cat].[dbo].[B].[PK])) 4 18 17 Clustered Index Scan Clustered
Index Scan OBJECT:([Cat].[dbo].[B].[PK]) [B].[FS], [B].[ID], [B].[FM],
[B].[FY], [B].[FZ], [B].[FH], [B].[FR], [B].[FE], [B].[FF], [B].[F_],
[B].[FJ], [X 714.0 5.1652573E-2 0.0008639 520 5.2516475E-2 [B].[FS],
[B].[ID], [B].[FM], [B].[FY], [B].[FZ], [B].[FH], [B].[FR], [B].[FE],
[B].[FF], [B].[F_], [B].[FJ], [X NULL PLAN_ROW 0 1.0
> | | |--Row Count Spool 4 22 8 Row Count Spool Lazy
Spool NULL NULL 1.0 0.0 0.0000001 4 0.41533375 NULL NULL PLAN_ROW 0
753.86066
> | | |--Index
Spool(SEEK:([FULLTEXT:A].[KEY]=[A].[ID])) 4 23 22 Index Spool Eager Spool
SEEK:([FULLTEXT:A].[KEY]=[A].[ID]) NULL 1.0 2.9842013E-2 7.9600002E-5 11
0.41530594 NULL NULL PLAN_ROW 0 278.02261
> | | |--Remote Scan(OBJECT:(CONTAINS)) 4
24 23 Remote Scan Remote Scan OBJECT:(CONTAINS) NULL 1000.0 0.0 0.36333334
15 0.36333334 [FULLTEXT:A].[KEY] NULL PLAN_ROW 0 1.0
> | |--Row Count Spool 4 25 7 Row Count Spool Lazy Spool
NULL NULL 1.0 0.0 1.7798983E-2 4 0.42855564 NULL NULL PLAN_ROW 0 678.47455
> | |--Index
Spool(SEEK:([FULLTEXT:B].[KEY]=[B].[FS])) 4 26 25 Index Spool Eager Spool
SEEK:([FULLTEXT:B].[KEY]=[B].[FS]) NULL 1.0 2.9842013E-2 7.9600002E-5 11
0.40765724 NULL NULL PLAN_ROW 0 181.93309
> | |--Remote Scan(OBJECT:(CONTAINS)) 4 27 26
Remote Scan Remote Scan OBJECT:(CONTAINS) NULL 1000.0 0.0 0.36333334 15
0.36333334 [FULLTEXT:B].[KEY] NULL PLAN_ROW 0 1.0
> |--Row Count Spool 4 28 6 Row Count Spool Lazy Spool NULL
NULL 1.0 0.0 0.01716849 4 0.42759585 NULL NULL PLAN_ROW 0 610.62708
> |--Index Spool(SEEK:([FULLTEXT:C].[KEY]=[C].[FO])) 4
29 28 Index Spool Eager Spool SEEK:([FULLTEXT:C].[KEY]=[C].[FO]) NULL 1.0
2.9842013E-2 7.9600002E-5 11 0.40763831 NULL NULL PLAN_ROW 0 181.69568
> |--Remote Scan(OBJECT:(CONTAINS)) 4 30 29
Remote Scan Remote Scan OBJECT:(CONTAINS) NULL 1000.0 0.0 0.36333334 15
0.36333334 [FULLTEXT:C].[KEY] NULL PLAN_ROW 0 1.0
>
> select *
> from A
> left outer join B on B.ID = A.ID
> left outer join C on C.ID = A.ID
> where contains(A.*, '"word"')
> or contains(B.*, '"word"')
> or contains(C.*, '"word"')
>
> **********************************************************************
> Sent via Fuzzy Software @ http://www.fuzzysoftware.com/
> Comprehensive, categorised, searchable collection of links to ASP &
ASP.NET resources...
- Previous message: Jason Robertson: "Re: DIFFICULT PROBLEM! SSL for SQL 2000 Server. MS Fix bulletin does not help at all."
- In reply to: malcolm.earnest_at_bom.co.uk: "Re: CONTAINS performance"
- Next in thread: Hilary Cotter: "Re: CONTAINS performance"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|