-1

I have large fact table, and a much smaller dimension table in a simple star schema:

--1. CREATE TABLE dbo.Dim ( Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED, CustomerName VARCHAR(2000) ) --index CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName); --2. CREATE TABLE dbo.Fact ( ... PurchaseDate DATE CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id) ... ) --index CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact; 

Running the following simple query, which filters on fact table, and joins in the dimension:

SELECT sd.CustomerName,f.* FROM dbo.Fact f INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId WHERE f.PurchaseDate IN ( '20000506', '20000507', '20000508', '20000509', '20000501', '20000502', '20000503' ) 

We get the following ugly query plan: enter image description here

Interestingly the dimension table tend to scan ALL its 500 000 rows in 4 iteration, but in the end only few thousand is needed in that date range of the fact table.

This is very inefficient with larger dimension tables, basically all the rows scanned all the time, like the lookup table indexes are not even there.

The expected thing would be that sql server first limits the fact table on the date range, then using this limited range of CustomerKeyId it looks up the CustomerName from the small dimension table using an index seek.

  1. Is this really how inefficiently the star schema is, or is there something i miss here?
  2. In other words, how could i force sql server to prepare the limited CustomerKeyId table and lookup only those? (with CTE somehow?)
1
  • I would expect a non-columnstore INDEX(CustomerNameId, PurchaseDate) to work nicely. And why use "columnstore" with star-schema -- they seem to compete with each other. Commented Aug 26, 2021 at 19:52

1 Answer 1

1

Here's a sample to play with:

--1. CREATE TABLE dbo.Dim ( Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED, CustomerName VARCHAR(2000) ) --index CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName); with q as ( select top 100000 row_number() over (order by (select null)) rn from sys.messages m, sys.objects o ) insert into dim(CustomerName) select concat('CustomerName',rn) from q --2. CREATE TABLE dbo.Fact ( PurchaseDate DATE, CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id) ) --index CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact; with q as ( select top 10000000 row_number() over (order by (select null)) rn from sys.messages m, sys.objects o ) insert into Fact(PurchaseDate,CustomerNameId) select dateadd(day,rn%1000,'20000101'), 1+rn%100000 from q SELECT sd.CustomerName,f.* FROM dbo.Fact f INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId WHERE f.PurchaseDate IN ( '20000506', '20000507', '20000508', '20000509', '20000501', '20000502', '20000503' ) SELECT sd.CustomerName,f.* FROM dbo.Fact f INNER LOOP JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId WHERE f.PurchaseDate IN ( '20000506', '20000507', '20000508', '20000509', '20000501', '20000502', '20000503' ) 

The plan is here.

You'll see that the loop join with the index seek is more expensive than scanning the dimension on each thread of the parallel execution and doing a hash join:

(70000 rows affected) SQL Server Execution Times: CPU time = 62 ms, elapsed time = 64 ms. (70000 rows affected) SQL Server Execution Times: CPU time = 108 ms, elapsed time = 90 ms. 
1
  • Thank you David, especially for the extensive testing . I started to increase the dimension row number and somewhere above 6-8 million of rows it indeed switched to Index seek. So it seems the star schema indexed simply like this performs as expected. Commented Jul 9, 2021 at 20:17

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.