I am doing some test with clustered columnstore tables - both partitioned and not. In my research, I've seen flavors of this pattern:
-- step 1: make new table SELECT TOP 0 * INTO [Fact].[Order_CCI] FROM [Fact].[Order]; -- step 2: rowstore by date CREATE CLUSTERED INDEX [CCI_Order] ON [Fact].[Order_CCI] ( [Order Date Key] ) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [ps_OrderDate]([Order Date Key]) -- step 3: insert data INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK) SELECT [Order Key] ,[City Key] ,[Customer Key] ,[Stock Item Key] ,[Order Date Key] .... FROM [Fact].[Order] -- step 4: replace rowstore with columnstore CREATE CLUSTERED COLUMNSTORE INDEX [CCI_Order] ON [Fact].[Order_CCI] WITH (DROP_EXISTING = ON, COMPRESSION_DELAY = 0) It uses a rowstore index to sort the data by the desired column before switching it to a CCI.
But in other articles or posts, I see the presence of an explicit "ORDER BY" in the insert statement:
INSERT INTO [Fact].[Order_CCI] WITH(TABLOCK) SELECT [Order Key] ,[City Key] ,[Customer Key] ,[Stock Item Key] ,[Order Date Key] .... FROM [Fact].[Order] ORDER BY [Order Date Key] I have these questions.
Is this the most performant code pattern for a large (500m row) table where partitioning is employed? In contrast, this author [https://janizajcbi.com/2018/09/14/row-store-to-column-store-story/] populates the partitions in a loop.
Is the ORDER BY needed? I ran this pattern on a small (30m row) table both with and without the ORDER BY and in both cases, the CCI metadata shows segments to be date aligned/sorted. I would think that the ORDER BY is not needed.
But I just ran the same on the large table, and the segments are not date aligned/sorted. But again note that it was a much larger table and was also partitioned.
Below is a more complete SQL for the test which I ran that DID NOT result in sorted segments
CREATE TABLE dbo.RiskExposure ( ... ); GO INSERT INTO dbo.RiskExposure WITH (TABLOCK) ( ... ) SELECT ... FROM dbo.RiskExposureStage WHERE Close_of_Bus_ > '2020-08-31' -- test with a subset GO CREATE CLUSTERED INDEX CCI_RiskExposure ON dbo.RiskExposure ( [Close_of_Bus_] ) WITH (DATA_COMPRESSION = PAGE) ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] ); CREATE CLUSTERED COLUMNSTORE INDEX CCI_RiskExposure ON dbo.RiskExposure WITH (DROP_EXISTING = ON) ON ps_QuarterlyPartitionScheme3m( [Close_of_Bus_] ); GO ALTER INDEX CCI_RiskExposure ON dbo.RiskExposure REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON); -- get stats SELECT partitions.partition_number, column_store_segments.segment_id, column_store_segments.min_data_id, column_store_segments.max_data_id, column_store_segments.row_count FROM sys.column_store_segments INNER JOIN sys.partitions ON column_store_segments.hobt_id = partitions.hobt_id INNER JOIN sys.indexes ON indexes.index_id = partitions.index_id AND indexes.object_id = partitions.object_id INNER JOIN sys.tables ON tables.object_id = indexes.object_id INNER JOIN sys.columns ON tables.object_id = columns.object_id AND column_store_segments.column_id = columns.column_id WHERE tables.name = 'RiskExposure' AND columns.name = 'Close_of_bus_' ORDER BY tables.name, columns.name, partitions.partition_number, column_store_segments.segment_id;