Re: Index chosen dependant on result columns
From: Steve Kass (skass_at_drew.edu)
Date: 05/07/04
- Next message: Dan Guzman: "Re: select + group by"
- Previous message: Mike: "Adding Preceeding zeros"
- In reply to: Andy Gilman: "Re: Index chosen dependant on result columns"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 06 May 2004 21:07:28 -0400
Andy,
Index hints can be put into a query (see Books Online) but they are
rarely necessary. I don't think I've ever seen the optimizer ignore a
non-clustered covering index (an index that contains all the columns
referred to in the query) in a situation like this, but if it did, I'm
sure there would be a good reason. I would recommend against using an
index hint unless you find your query running slowly and even then, I'd
check first to see if statistics on the indexes need to be updated more
frequently, or if something other than the query plan is causing the
problem.
There's nothing at all wrong with tailoring indexes to a specific
workload - that's part of database tuning. Assuming the volume of data
you handle will stay relatively constant and the speed you are seeing is
acceptable, you can probably leave things the way you have them. I
can't say that an index hint is likely to do harm, but I don't suggest
using them unless the optimizer is coming up with incorrect estimates,
which doesn't seem to be the case here.
If you do want to look further, there are other indexing possibilities
that could be even better for your query than what you have, without
changing the clustered index:
two nonclustered indexes, one on (B,D) and one on (C)
two nonclustered indexes, one on (B) and one on (C,D)
one nonclustered index on (C,B,D) instead of the one you have
Whether any of these are better will depend on the relative number of
rows satisfying each inequality, and the degree to which the qualifying
rows for each inequality overlap. There shouldn't be too much
difference in the overhead of maintaining the indexes. With one of the
first two choices, the optimizer will consider index seeks on separate
indexes for each half of the predicate, followed by a hash match to
identify the rows satisfying both. That plan can pick up the value of D
from whichever index contained D. If one inequality is consistently
more selective, you should put D in that index, especially if D is a
wide column. The advantage to this indexing solution is that both
inequalities in your query can be optimized with index seeks, but the
query is still covered.
(C,B,D) may help if the C inequality is usually more selective than the
B inequality, since the (B,C,D) index can only seek on the inequality on
B, and will have to check the other inequality row by row. The
improvement you're seeing by adding D to the index could come from the
selectivity on B or because (B,C,D,A) contains many fewer bytes than a
table row.
It's even conceivable that your best solution is to have all of these
indexes, if there is enough variety in the values sent to this query
that there is no one indexing strategy that works well for all values.
Steve
Andy Gilman wrote:
>thanks for your clear explanation. i was thinking this was kind of the
>problem but wanted to get a better voice of experience. much appreciated. oh
>and i dont tend to use * except for in Query Analyzer so hopefully that
>problem wont be a ptoblem.
>
>my first remaining concern is that i have now created an index which 'knows'
>too much about the use its being put to. i was forced to create it because
>the two queries mentioned below took 99% and 1% when ran together. In fact
>the clustered index seek was so slow that the program would actually timeout
>waiting for a result. if this had been a production application with many
>users i'd be in trouble. (it *is* a production app - just very few internal
>users).
>
>my other concern is that the optimizer wont always choose the same index and
>therefore i will have a hard time guaranteeing that this will always work.
>
>am i correct in assuming that this is pretty much the only choice i have
>here
>
>-simon
>
>"Steve Kass" <skass@drew.edu> wrote in message
>news:uEqIvT8MEHA.2628@TK2MSFTNGP12.phx.gbl...
>
>
>>Andy,
>>
>> The result columns have a great deal to do with the optimal query
>>plan. Your nonclustered index on B,C is like a table that contains
>>columns B and C from your table, maintained in increasing order, along
>>with column A, the key of the clustered index of the table, A, to refer
>>to the row of the table corresponding to each row in the index. (There
>>may also be a hidden uniquifier along with A, if the clustered index is
>>not unique). The "table" in this case is the clustered index on A,
>>which contains all the data of the table, i.e., all columns.
>>
>> If you want to include D in each row of your result, the clustered
>>index must be consulted, since the values of D are not in the
>>non-clustered index on (B,C). This can be very time-consuming, since it
>>generally must be done by making a separate lookup in the clustered
>>index to find the D value for each row returned. Suppose the optimizer
>>estimates that 10000 rows will satisfy the condition B > 1 and C > 2.
>>The cost to use the nonclustered index will include 10000 lookups in the
>>clustered index, since while the non-clustered index can identify the
>>rows from which your results must come, it cannot find the D values to
>>return, since the nonclustered index only contains values for columns B,
>>C, and A. This may well require accessing most of the data pages of the
>>table, even if the table contains hundreds of thousands of rows, since
>>the optimizer will assume that these 10000 rows are distributed
>>uniformly throughout the table. It seems this is what you are
>>encountering, and the optimizer finds no benefit to using the index, and
>>instead chooses the plan that simply scans the entire clustered index,
>>evaluating B > 1 and C > 2 for every row.
>>
>> Unfortunately, while you can (at the expense of increased work to
>>maintain indexes) include every column in every index, there is no way
>>to have newly added columns automatically added to nonclustered
>>indexes. But I think one should expect to need a number of changes to
>>implement a change in table structure - ideally, table structure is
>>settled well before indexes are created and workloads are tuned, and
>>changes rarely if ever. I also don't recommend the use of SELECT * in
>>production queries; a query should return the same set of columns each
>>time it is run - if there's some reason this is unacceptable, it's
>>likely there is a more fundamental problem with the design of the
>>database, which may contain repeated groups or some other violation of
>>first normal form.
>>
>> That said, I think you're probably ahead of the game, since you
>>recognize the benefit of putting an index on more than one column, which
>>many people never do, and since you are asking the right questions and
>>running the right experiments!
>>
>>-- Steve Kass
>>-- Drew University
>>-- Ref: 4703ADA2-C939-4294-8AB4-B98421384A7E
>>
>>Andy Gilman wrote:
>>
>>
>>
>>>I have a table with columns A, B, C, D, E
>>>
>>>I have a clustered index on A
>>>I have a non clustered index on B, C
>>>
>>>I have a query
>>>
>>>SELECT A, D FROM Table
>>>WHERE B > 1 AND C > 2
>>>
>>>and
>>>
>>>SELECT A, B, C FROM Table
>>>WHERE B > 1 AND C > 2
>>>
>>>I obviously want it to use my index on B, C for these queries. The first
>>>query runs very slowly and the second very quickly.
>>>
>>>I went to the execution plan to see what was going on.
>>>
>>>Turns out the B, C index is only used provided that the only columns in
>>>
>>>
>the
>
>
>>>SELECT clause are also in the index. By adding column D to my index the
>>>first query runs just as fast. It also appears column A can be added
>>>
>>>
>without
>
>
>>>problem since this is the primary key (?)
>>>
>>>I am now confused because I dont see why the result columns should
>>>
>>>
>influence
>
>
>>>which index is used. This could be very bad if people are using SELECT *
>>>
>>>
>and
>
>
>>>a new column is added.
>>>
>>>Am I doing something wrong. I thought I was getting the hang of indexes
>>>
>>>
>but
>
>
>>>I dont see how to make efficient indexes if they have to include result
>>>columns too.
>>>
>>>-simon
>>>
>>>
>>>
>>>
>>>
>>>
>
>
>
>
- Next message: Dan Guzman: "Re: select + group by"
- Previous message: Mike: "Adding Preceeding zeros"
- In reply to: Andy Gilman: "Re: Index chosen dependant on result columns"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|