4

I have been doing some testing to try to better understand how SQL Server uses a histogram to estimate the number of rows that will match an equality predicate and also a < or > predicate

Given I am using the AdventureWorks2016 OLTP database

If can understand SQL Server's estimation process for = and > predicates:

/* update stats with fullscan first */ UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN 

Then I can see the histogram for the column TransactionHistory.Quantity

DBCC SHOW_STATISTICS ( 'Production.TransactionHistory', 'Quantity') 

The below screenshot is the top end of the histogram where I have run my tests:

enter image description here

The following query will estimate 6 rows as the value in the predicate is a RANGE_HI_KEY so uses the EQ_ROWS for that bucket:

SELECT * FROM Production.TransactionHistory WHERE Quantity = 2863 

The following will estimate 1.36 rows as it is not a RANGE_HI_KEY so uses the AVG_RANGE_ROWS for the bucket it falls in:

SELECT * FROM Production.TransactionHistory WHERE Quantity = 2862 

The following "greater than" query will estimate 130 rows which appears to be the sum of the RANGE_ROWS and the EQ_ROWS for all the buckets with a RANGE_HI_KEY > 2863

SELECT * FROM Production.TransactionHistory WHERE Quantity > 2863 

A similar query below, but the value is not a RANGE_HI_KEY in the histogram. SQL Server again estimates 130 and appears to use the same method as above

SELECT * FROM Production.TransactionHistory WHERE Quantity > 2870 

This all makes sense up to now so my testing moved onto a "less than" query

SELECT * FROM Production.TransactionHistory WHERE Quantity < 490 

for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:

RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.

How does SQL Server use the histogram to estimate the number of rows that will be returned by a "less than" predicate

0

1 Answer 1

6
+100

for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:

RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.

You're 13 short, not 11: 109,579 - 109,566 = 13.

The general idea, as shown in my related answer is to use linear interpolation within the partial step, assuming uniformity.

In your case:

histogram fragment

So the question is how many of those 23 RANGE_ROWS do we expect to match the predicate < 490 when they are assumed to be distributed uniformly within the histogram step with RANGE_HI_KEY 500:

DECLARE @ARR float = 23e0 / 6e0, -- AVG_RANGE_ROWS @DRR float = 6e0, -- DISTINCT_RANGE_ROWS @PR float = 490 - 470, -- predicate range @SR float = 499 - 470 -- whole step range (excluding high key) SELECT (@DRR - 1) * ((@PR - 1) / @SR) / ((@SR - 1) / @SR) * @ARR; 

This computation gives 13.00595.

The -1 factors account for using < which is assumed to exclude a DISTINCT_RANGE_ROW row. When <= is used, that row is assumed to match the predicate.

The whole thing is a modification of applying the fraction of the range you are asking for versus the range covered by the histogram step. Without excluding the unmatched value, it would be simply @PR/@SR.

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.