What is the correct and performant pattern for loading (and querying) a large clustered columnstore table?

Question

I am doing some test with clustered columnstore tables - both partitioned and not. In my research, I've seen flavors of this pattern:

-- step 1: make new table SELECT TOP 0 * INTO [Fact].[Order_CCI] FROM [Fact].[Order]; -- step 2: rowstore by date CREATE CLUSTERED INDEX [CCI_Order] ON [Fact].[Order_CCI] ( [Order Date Key] ) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [ps_OrderDate]([Order Date Key]) -- step 3: insert data INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK) SELECT [Order Key] ,[City Key] ,[Customer Key] ,[Stock Item Key] ,[Order Date Key] .... FROM [Fact].[Order] -- step 4: replace rowstore with columnstore CREATE CLUSTERED COLUMNSTORE INDEX [CCI_Order] ON [Fact].[Order_CCI] WITH (DROP_EXISTING = ON, COMPRESSION_DELAY = 0)

It uses a rowstore index to sort the data by the desired column before switching it to a CCI.

But in other articles or posts, I see the presence of an explicit "ORDER BY" in the insert statement:

INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK) SELECT [Order Key] ,[City Key] ,[Customer Key] ,[Stock Item Key] ,[Order Date Key] .... FROM [Fact].[Order] ORDER BY [Order Date Key]

I have these questions.

Is this the most performant code pattern for a large (500m row) table where partitioning is employed? In contrast, this author [https://janizajcbi.com/2018/09/14/row-store-to-column-store-story/] populates the partitions in a loop.

Is the ORDER BY needed? I ran this pattern on a small (30m row) table both with and without the ORDER BY and in both cases, the CCI metadata shows segments to be date aligned/sorted. I would think that the ORDER BY is not needed.

But I just ran the same on the large table, and the segments are not date aligned/sorted. But again note that it was a much larger table and was also partitioned.

Below is the SQL pattern that DID NOT partition but DID result in sorted segments

CREATE TABLE dbo.RiskExposure ( ... ); GO INSERT INTO dbo.RiskExposure WITH (TABLOCK) ( ... ) SELECT ... FROM dbo.RiskExposureStage WHERE Close_of_Bus_ > '2020-08-31' GO CREATE CLUSTERED INDEX CCI_RiskExposure ON dbo.RiskExposure ([Close_of_Bus_]); GO CREATE CLUSTERED COLUMNSTORE INDEX CCI_RiskExposure ON dbo.Risk WITH (MAXDOP = 1, DROP_EXISTING = ON);

Below is a more complete SQL for the test which I ran that DID partition but DID NOT result in sorted segments.

CREATE TABLE dbo.RiskExposure ( ... ); GO INSERT INTO dbo.RiskExposure WITH (TABLOCK) ( ... ) SELECT ... FROM dbo.RiskExposureStage WHERE Close_of_Bus_ > '2020-08-31' -- test with a subset GO CREATE CLUSTERED INDEX CCI_RiskExposure ON dbo.RiskExposure ( [Close_of_Bus_] ) WITH (DATA_COMPRESSION = PAGE) ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] ); CREATE CLUSTERED COLUMNSTORE INDEX CCI_RiskExposure ON dbo.RiskExposure WITH (DROP_EXISTING = ON) ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] ); GO ALTER INDEX CCI_RiskExposure ON dbo.RiskExposure REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON); -- get stats SELECT partitions.partition_number, column_store_segments.segment_id, column_store_segments.min_data_id, column_store_segments.max_data_id, column_store_segments.row_count FROM sys.column_store_segments INNER JOIN sys.partitions ON column_store_segments.hobt_id = partitions.hobt_id INNER JOIN sys.indexes ON indexes.index_id = partitions.index_id AND indexes.object_id = partitions.object_id INNER JOIN sys.tables ON tables.object_id = indexes.object_id INNER JOIN sys.columns ON tables.object_id = columns.object_id AND column_store_segments.column_id = columns.column_id WHERE tables.name = 'RiskExposure' AND columns.name = 'Close_of_bus_' ORDER BY tables.name, columns.name, partitions.partition_number, column_store_segments.segment_id;

Here's a subset of the stats for the test where the table was not partitioned. Note that the rowgroups are packed (except for #35) and ordered.

partition_number segment_id min_data_id max_data_id row_count 1 0 737668 737673 1048576 1 1 737673 737677 1048576 1 2 737677 737680 1048576 1 3 737680 737684 1048576 1 4 737684 737686 1048576 1 5 737686 737689 1048576 1 6 737689 737691 1048576 1 7 737691 737693 1048576 1 8 737693 737695 1048576 .... 1 37 737786 737789 1048576 1 38 737789 737794 1048576 1 39 737794 737799 742086 1 40 737799 737799 34421

Here are the stats for the test where partitions were employed. Note that the rowgroups are not ordered. Note I'm showing just the middle partition for brevity.

partition_number segment_id min_data_id max_data_id row_count 17 0 737698 737721 1048576 17 1 737698 737708 1048576 17 2 737698 737716 1048576 17 3 737698 737711 1048576 17 4 737712 737728 1048576 17 5 737716 737717 324458 17 6 737717 737744 1048576 17 7 737721 737723 1048576 17 8 737708 737768 1048576 17 9 737711 737788 1048576 17 10 737728 737734 1048576 17 11 737744 737747 1048576 17 12 737788 737789 643250 17 13 737723 737726 1048576 17 14 737768 737772 1048576 17 15 737734 737739 1048576 17 16 737747 737750 1048576 17 17 737726 737728 1048576 17 18 737772 737776 1048576 17 19 737728 737728 193276 17 20 737739 737742 1048576 17 21 737750 737752 1048576 17 22 737742 737743 314208 17 23 737776 737779 1048576 17 24 737752 737755 1048576 17 25 737779 737781 1048576 17 26 737755 737757 1048576 17 27 737781 737784 1048576 17 28 737757 737760 1048576 17 29 737784 737786 920474 17 30 737760 737764 686860 17 31 737716 737789 967708 17 32 737728 737743 507484

After re-running the query that does partitioning with the addition of MAXDOP=1, I am getting sorted segments just like in the non-partitioned test. So my error in leaving out that critical setting for the partitioned test. Now I am contemplating what is the correct logic for inserting new data. I often see this suggested: 1. create Staging Table 2. load Staging Table 3. add clustered index to Staging Table 4. INSERT INTO <Final Table> SELECT <list columns> FROM <Staging Table>

Step 4 isn't guaranteed to preserve the order, but if you run with MAXDOP=1 it typically will. You could also build the CCI on the staging table and then use partition switching. If you're loading from files, you could sort them and bulk load the CCI directly. — David Browne - Microsoft
– David Browne - Microsoft, Commented Jan 13, 2021 at 18:05

David Browne - Microsoft · Accepted Answer · 2021-01-10 16:51:53Z

The ORDER BY on INSERT is not guaranteed to load the data in order. In particular the INSERT . . . SELECT can be parallelized, and the resulting table will not be loaded in any particular order.

Even single-threaded, the ORDER BY is ignored by INSERT ... SELECT. e g

drop table if exists test go select * into test from FactInternetSales where 1=0 create clustered columnstore index cci_test on test insert into test --with (tablock) select s.* from FactInternetSales s order by ProductKey

The insert doesn't have a sort operator:

This is documented here:

When used with a SELECT...INTO statement to insert rows from another source, the ORDER BY clause does not guarantee the rows are inserted in the specified order.

SELECT - ORDER BY Clause (Transact-SQL)

And if the table is big enough, you can check the column segments to see that they are not in fact ordered by ProdctKey.

select object_name(p.object_id) table_name, s.segment_id, c.name column_name, s.min_data_id, s.max_data_id from sys.column_store_segments s join sys.partitions p on s.hobt_id = p.hobt_id left join sys.columns c on c.object_id = p.object_id and c.column_id = s.column_id where c.name = 'ProductKey'

J.D. · Accepted Answer · 2021-01-10 15:32:40Z

There may be other additional things to help improve the process you're trying to achieve but these are the couple things I'm aware of:

It will cause almost double work for your server to create a rowstore index first and then create a columnstore index after for your initial data load and creation of the indexes. But in doing so, you'll get a better aligned columnstore index that will result in more efficient querying against the table later on. So it's a trade-off between if you want to minimize server load upfront and pay for it when you query against the table later on, or take the load upfront and future querying that uses the columnstore index will be more efficient. This is dependent on what kind of querying and how often you'll be doing it.
If you choose to create the rowstore index then it would be most performant to insert the data into the indexless table first and then create the rowstore index after (so you would swap step 2 and step 3 above). Rowstore indexes typically sort the data most efficiently when it has the entire dataset upfront.
You should use the TABLOCK hint (as is in your first example) for the INSERT. This will minimize logging and automatically acquire a single table level lock instead of trying to acquire multiple row level locks that eventually get escalated anyway. So there is definitely a performance benefit here when using it for a large INSERT.
I agree with you, I don't believe the ORDER BY gives you any benefit when creating the columnstore index, but may affect alignment and future querying performance when your queries leverage the columnstore index (see #1).
I believe (though not positive on this one) that creating the columnstore index after the table is fully loaded with your data will be the most optimal time to create it (same reasoning as #2).

Those points aside, your table is at the point where I'd recommend using partitioning, as you'll be able to more granularly choose how you maintain that data (by partition), which means you can smartly do things more efficiently without blowing up the server with resource contention.

Stack Exchange Network

What is the correct and performant pattern for loading (and querying) a large clustered columnstore table?

2 Answers 2

Hot Network Questions

What is the correct and performant pattern for loading (and querying) a large clustered columnstore table?

2 Answers 2

Related

Hot Network Questions