Efficient dimension and fact joining

Question

I have large fact table, and a much smaller dimension table in a simple star schema:

--1. CREATE TABLE dbo.Dim ( Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED, CustomerName VARCHAR(2000) ) --index CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName); --2. CREATE TABLE dbo.Fact ( ... PurchaseDate DATE CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id) ... ) --index CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact;

Running the following simple query, which filters on fact table, and joins in the dimension:

SELECT sd.CustomerName,f.* FROM dbo.Fact f INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId WHERE f.PurchaseDate IN ( '20000506', '20000507', '20000508', '20000509', '20000501', '20000502', '20000503' )

We get the following ugly query plan:

Interestingly the dimension table tend to scan ALL its 500 000 rows in 4 iteration, but in the end only few thousand is needed in that date range of the fact table.

This is very inefficient with larger dimension tables, basically all the rows scanned all the time, like the lookup table indexes are not even there.

The expected thing would be that sql server first limits the fact table on the date range, then using this limited range of CustomerKeyId it looks up the CustomerName from the small dimension table using an index seek.

Is this really how inefficiently the star schema is, or is there something i miss here?
In other words, how could i force sql server to prepare the limited CustomerKeyId table and lookup only those? (with CTE somehow?)

I would expect a non-columnstore INDEX(CustomerNameId, PurchaseDate) to work nicely. And why use "columnstore" with star-schema -- they seem to compete with each other. — Rick James
– Rick James, Commented Aug 26, 2021 at 19:52

David Browne - Microsoft · Accepted Answer · 2021-07-09 16:57:28Z

Here's a sample to play with:

--1. CREATE TABLE dbo.Dim ( Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED, CustomerName VARCHAR(2000) ) --index CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName); with q as ( select top 100000 row_number() over (order by (select null)) rn from sys.messages m, sys.objects o ) insert into dim(CustomerName) select concat('CustomerName',rn) from q --2. CREATE TABLE dbo.Fact ( PurchaseDate DATE, CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id) ) --index CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact; with q as ( select top 10000000 row_number() over (order by (select null)) rn from sys.messages m, sys.objects o ) insert into Fact(PurchaseDate,CustomerNameId) select dateadd(day,rn%1000,'20000101'), 1+rn%100000 from q SELECT sd.CustomerName,f.* FROM dbo.Fact f INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId WHERE f.PurchaseDate IN ( '20000506', '20000507', '20000508', '20000509', '20000501', '20000502', '20000503' ) SELECT sd.CustomerName,f.* FROM dbo.Fact f INNER LOOP JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId WHERE f.PurchaseDate IN ( '20000506', '20000507', '20000508', '20000509', '20000501', '20000502', '20000503' )

The plan is here.

You'll see that the loop join with the index seek is more expensive than scanning the dimension on each thread of the parallel execution and doing a hash join:

(70000 rows affected) SQL Server Execution Times: CPU time = 62 ms, elapsed time = 64 ms. (70000 rows affected) SQL Server Execution Times: CPU time = 108 ms, elapsed time = 90 ms.

Thank you David, especially for the extensive testing . I started to increase the dimension row number and somewhere above 6-8 million of rows it indeed switched to Index seek. So it seems the star schema indexed simply like this performs as expected. — Avi
– Avi, Commented Jul 9, 2021 at 20:17

Stack Exchange Network

Efficient dimension and fact joining

1 Answer 1

Hot Network Questions

Efficient dimension and fact joining

1 Answer 1

Related

Hot Network Questions