Query governor problem

From: jan (anomalocarus_at_hotmail.com)
Date: 09/16/04


Date: 16 Sep 2004 02:07:08 -0700

Hi all,
We're running MS SQL 2000, latest service pack. Windows 2000 server,
advanced server and Windows 2003 server. Assorted hardware.
The above configuration refers to various client sites, and the SQL
we're having problems with is generated by an object/relational
mapping layer so we cannot easily change this.
Our problem is that certain queries are occasionally take far, far
longer to execute than at other times. My suspicion is that this
occurs when the database server is heavily loaded, and this causes it
to pick an alternative, and awful, execution plan. This only makes
things worse for everyone else. The query is given at the end of this
post.
The query will normally execute in less than a second. I can copy it
directly out of the profiler, which shows it is taking 1280 seconds of
cpu time and about 4500 seconds duration, paste it into the query
analyzer of the correct database, and a short while later have it
execute in a flash.
Statistics have very recently been updated. The data will not have
changed significantly either.
Over the next few days and we will be turning off intra query
parallelism and hyperthreading because we know these have caused us
problems in the past and may very well be the cause of things here,
but in the mean time as a very short term measure I have considered
using the query governor.
Unfortunately I don't think this will be the solution. I suspect that
the cpu estimate used by the query governor will always be for the
sub-second execution times (because if the query plan realised it
would take 20 minutes of cpu time to execute the query then it would
certainly have chosen something better, which it clearly can under the
right circumstances. If the query planner fails this way, and the
query governor is presumably driven by the estimate of the query
planner, then the query Governor will not catch these excessive
execution times). I'm pretty sure that the query governor only works
by estimates and does not actually 'watch over' the query and count
the cpu ticks.
So it looks like this will not help. This is a critical issue for our
customers. I'm therefore looking for some way of observing other
queries from a special watchdog job and killing them if the cpu
exceeds a certain amount, say a minute.
I have to look for single queries, not connections as many users are
serialised down to a few connections. Using the enterprise manager's
'process info' page I can see per-connection info but this it is not
fine-grained enough for what I want. I've discovered the sysperfinfo
table but that doesn't seem to help.
Any suggestions?
I suppose it would be better than nothing to be able to watch over
connections and if I could somehow detect excessive cpu activity on
that then I could kill that connection, even if it made a small group
of users unhappy. Better that than the whole site irate. Can anyone
suggest reliable way of doing that? How would I distinguish a very
busy but correctly functioning connection from one which was tied up
too long with a single query?
For what it's worth, when these mad queries start to execute their
lock-up the database server so tightly that it's almost impossible to
get in to kill things manually.

The query:

SELECT A14, A15, A16, A17, A18, A19, A20, A21, A22, A23
FROM (
           SELECT <field names removed for confidentiality
reasons>
           FROM <ditto table name> T13
           ) D24
WHERE ((A18 IN (21799))
            AND (A17 IN (13575))
           )
           OR ((A21 IN (2130740, 2413061, 2199038, 2107888,
                        2147294, 2390317, 2390947, 2262791,
                        2262540, 2175261, 2436256, 2163102,
                        2119434, 2272644, 2356412, 2240713,
                        2282838, 2318792, 2333253, 2230674,
                        2083129, 2402966, 2367886, 2187455,
                        2297886, 2251478, 2069408, 2309686,
                        2450636, 2424679, 2061268, 2219841
                       )
               )
               AND (A17 IN (18583))
              )
           OR ((A18 IN (2130740, 2413061, 2199038, 2107888,
                        2147294, 2390317, 2390947, 2262791,
                        2262540, 2175261, 2436256, 2163102,
                        2119434, 2272644, 2356412, 2240713,
                        2282838, 2318792, 2333253, 2230674,
                        2083129, 2402966, 2367886, 2187455,
                        2297886, 2251478, 2069408, 2309686,
                        2450636, 2424679, 2061268, 2219841
                       )
               )
               AND (A17 IN (14849, 15468))
              )
           OR ((A21 IN (2129425, 2412043, 2197272, 2106911,
                        2146319, 2389802, 2260770, 2173489,
                        2434610, 2161990, 2117737, 2271342,
                        2355571, 2239606, 2282111, 2316669,
                        2332031, 2229624, 2081664, 2401161,
                        2366604, 2186416, 2296450, 2250431,
                        2067383, 2308778, 2450673, 2423002,
                        2061282, 2218767
                       )
               )
               AND (A17 IN (18583))
              )
           OR ((A18 IN (2129425, 2412043, 2197272, 2106911,
                        2146319, 2389802, 2260770, 2173489,
                        2434610, 2161990, 2117737, 2271342,
                        2355571, 2239606, 2282111, 2316669,
                        2332031, 2229624, 2081664, 2401161,
                        2366604, 2186416, 2296450, 2250431,
                        2067383, 2308778, 2450673, 2423002,
                        2061282, 2218767
                       )
               )
               AND (A17 IN (14849, 15468))
              )
           OR ((A18 IN (75300, 84834, 77226, 74551, 75693,
                        83958, 79587, 76442, 85713, 76074,
                        74904, 79967, 82740, 78787, 80387,
                        81446, 81966, 78422, 73758, 84374,
                        83124, 76859, 80771, 79203, 81187,
                        86351, 85251, 73040, 78012
                       )
               )
               AND (A17 IN (15818, 15790))
              )
           OR ((A18 IN (6434, 6835, 6511, 6652, 7338, 7381,
                        7516, 7053, 8076, 6751, 7212, 8346,
                        5951
                       )
               )
               AND (A17 IN (15862))
              )
           OR ((A18 IN (88417, 102177, 91248, 87350, 88994,
                        101065, 94706, 90093, 103296, 86807,
                        89570, 87850, 95275, 99386, 93535,
                        95910, 92699, 97459, 98251, 92997,
                        86154, 101590, 99922, 90709, 96465,
                        94143, 85304, 97084, 99933, 87091,
                        104110, 102719, 84897, 100531, 92404
                       )
               )
               AND (A17 IN (38577, 37251, 12998))
              )
           OR ((A18 IN (6014, 6357, 6075, 6196, 6790, 6829,
                        6950, 6542, 7442, 6277, 6671, 7689,
                        5614
                       )
               )
               AND (A17 IN (15847))
              )
           OR ((A18 IN (21799))
               AND (A17 IN (14519))
              )

FYI I have seen queries like this cause problems in the past, and on a
single processor machines. It appears that large combinations of
'and's and 'or's can cause big problems with the query planner,
possibly made worse if the statistics read out of date (which they are
not here).

Any help appreciated

cheers

jan