Part of my workload uses a CLR function that implements the spooky hash algorithm to compare rows to see if any column values have changed. The CLR function takes a binary string as an input, so I need a fast way to convert rows to a binary string. I expect to hash around 10 billion rows during the full workload so I'd like this code to be as fast as possible.
I have about 300 tables with different schemas. For the purposes of this question please assume a simple table structure of 32 nullable INT columns. I've provided sample data as well as a way to benchmark results at the bottom of this question.
Rows must be converted to the same binary string if all column values are the same. Rows must be converted to different binary strings if any column value is different. For example, code as simple as the following will not work:
CAST(COL1 AS BINARY(4)) + CAST(COL2 AS BINARY(4)) + .. It does not handle NULLs correctly. If COL1 is NULL for row 1 and COL2 is NULL for row 2 then both rows will be converted to a NULL string. I believe that correct handling of NULLs is the hardest part of converting the entire row correctly.
To preempt some questions:
- If it matters, a significant majority of the time (90%+) the columns won't be NULL.
- I have to use the CLR.
- I have to hash this many rows. I cannot persist the hashes.
- I have to do this in SQL Server.
- I believe that I cannot use batch mode for the conversion due to the presence of the CLR function.
What is the fastest way to convert 32 nullable INT columns to a BINARY(X) or VARBINARY(X) string?
Sample data and code as promised:
-- create sample data DROP TABLE IF EXISTS dbo.TABLE_OF_32_INTS; CREATE TABLE dbo.TABLE_OF_32_INTS ( COL1 INT NULL, COL2 INT NULL, COL3 INT NULL, COL4 INT NULL, COL5 INT NULL, COL6 INT NULL, COL7 INT NULL, COL8 INT NULL, COL9 INT NULL, COL10 INT NULL, COL11 INT NULL, COL12 INT NULL, COL13 INT NULL, COL14 INT NULL, COL15 INT NULL, COL16 INT NULL, COL17 INT NULL, COL18 INT NULL, COL19 INT NULL, COL20 INT NULL, COL21 INT NULL, COL22 INT NULL, COL23 INT NULL, COL24 INT NULL, COL25 INT NULL, COL26 INT NULL, COL27 INT NULL, COL28 INT NULL, COL29 INT NULL, COL30 INT NULL, COL31 INT NULL, COL32 INT NULL ); INSERT INTO dbo.TABLE_OF_32_INTS WITH (TABLOCK) SELECT 0, 123, 12345, 1234567, 123456789 , 0, 123, 12345, 1234567, 123456789 , 0, 123, 12345, 1234567, 123456789 , 0, 123, 12345, 1234567, 123456789 , 0, 123, 12345, 1234567, 123456789 , 0, 123, 12345, 1234567, 123456789 , NULL, -876545321 FROM ( SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN FROM master..spt_values t1 CROSS JOIN master..spt_values t2 ) q OPTION (MAXDOP 1); GO -- procedure to test performance CREATE OR ALTER PROCEDURE #p AS BEGIN SET NOCOUNT ON; DECLARE @counter INT = 0, @dummy VARBINARY(8000); WHILE @counter < 10 BEGIN SELECT @dummy = -- this code is clearly incomplete as it does not handle NULLs CAST(COL1 AS BINARY(4)) + CAST(COL2 AS BINARY(4)) + CAST(COL3 AS BINARY(4)) + CAST(COL4 AS BINARY(4)) + CAST(COL5 AS BINARY(4)) + CAST(COL6 AS BINARY(4)) + CAST(COL7 AS BINARY(4)) + CAST(COL8 AS BINARY(4)) + CAST(COL9 AS BINARY(4)) + CAST(COL10 AS BINARY(4)) + CAST(COL11 AS BINARY(4)) + CAST(COL12 AS BINARY(4)) + CAST(COL13 AS BINARY(4)) + CAST(COL14 AS BINARY(4)) + CAST(COL15 AS BINARY(4)) + CAST(COL16 AS BINARY(4)) + CAST(COL17 AS BINARY(4)) + CAST(COL18 AS BINARY(4)) + CAST(COL19 AS BINARY(4)) + CAST(COL20 AS BINARY(4)) + CAST(COL21 AS BINARY(4)) + CAST(COL22 AS BINARY(4)) + CAST(COL23 AS BINARY(4)) + CAST(COL24 AS BINARY(4)) + CAST(COL25 AS BINARY(4)) + CAST(COL26 AS BINARY(4)) + CAST(COL27 AS BINARY(4)) + CAST(COL28 AS BINARY(4)) + CAST(COL29 AS BINARY(4)) + CAST(COL30 AS BINARY(4)) + CAST(COL31 AS BINARY(4)) + CAST(COL32 AS BINARY(4)) FROM dbo.TABLE_OF_32_INTS OPTION (MAXDOP 1); SET @counter = @counter + 1; END; SELECT cpu_time FROM sys.dm_exec_requests WHERE session_id = @@SPID; END; GO -- run procedure EXEC #p;