I have been doing some testing to try to better understand how SQL Server uses a histogram to estimate the number of rows that will match an equality predicate and also a < or > predicate
Given I am using the AdventureWorks2016 OLTP database
If can understand SQL Server's estimation process for = and > predicates:
/* update stats with fullscan first */ UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN Then I can see the histogram for the column TransactionHistory.Quantity
DBCC SHOW_STATISTICS ( 'Production.TransactionHistory', 'Quantity') The below screenshot is the top end of the histogram where I have run my tests:
The following query will estimate 6 rows as the value in the predicate is a RANGE_HI_KEY so uses the EQ_ROWS for that bucket:
SELECT * FROM Production.TransactionHistory WHERE Quantity = 2863 The following will estimate 1.36 rows as it is not a RANGE_HI_KEY so uses the AVG_RANGE_ROWS for the bucket it falls in:
SELECT * FROM Production.TransactionHistory WHERE Quantity = 2862 The following "greater than" query will estimate 130 rows which appears to be the sum of the RANGE_ROWS and the EQ_ROWS for all the buckets with a RANGE_HI_KEY > 2863
SELECT * FROM Production.TransactionHistory WHERE Quantity > 2863 A similar query below, but the value is not a RANGE_HI_KEY in the histogram. SQL Server again estimates 130 and appears to use the same method as above
SELECT * FROM Production.TransactionHistory WHERE Quantity > 2870 This all makes sense up to now so my testing moved onto a "less than" query
SELECT * FROM Production.TransactionHistory WHERE Quantity < 490 for this query, SQL Server estimates 109,579 rows but I can't work out where it has got that from:
RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.
How does SQL Server use the histogram to estimate the number of rows that will be returned by a "less than" predicate

