Comparing grouped data

Question

Please see the DDL below:

 CREATE Table #Grouping1 (GroupID int, SystemID int, RecordID int, Name varchar(100), DateOfBirth datetime) INSERT INTO #Grouping1 values (1,1,1,'Mark Williams', '1980-01-01') INSERT INTO #Grouping1 values (1,2,128,'Mark Welliams', '1980-01-01') INSERT INTO #Grouping1 values (1,3,36,'Marko Williams', '1980-01-01') INSERT INTO #Grouping1 values (2,1,18,'Anne Smith', '1960-01-23') INSERT INTO #Grouping1 values (2,2,64,'Anna Smyth', '1960-01-23') INSERT INTO #Grouping1 values (2,8,23,'Annie Smith', '1960-01-23')

The table shows that there are three records in three systems for one person and three records in three systems for another person (the group id says the people are linked).

I am in the process of upgrading this system and the group IDs have changed. For example, please see the DDL below:

CREATE Table #Grouping2 (GroupID int, SystemID int, RecordID int, Name varchar(100), DateOfBirth datetime) INSERT INTO #Grouping2 values (187,1,1,'Mark Williams', '1980-01-01') INSERT INTO #Grouping2 values (187,2,128,'Mark Welliams', '1980-01-01') INSERT INTO #Grouping2 values (208,1,18,'Anne Smith', '1960-01-23') INSERT INTO #Grouping2 values (208,2,64,'Anna Smyth', '1960-01-23') INSERT INTO #Grouping2 values (208,8,23,'Annie Smith', '1960-01-23')

I want to check that all the people that were linked together are still linked together. I was thinking about doing something like this.

select * into #OldTable from ( select grouping1a.groupid, grouping1a.systemid as systemid1,grouping1a.recordid as recordid1, grouping1b.systemid as systemid2,grouping1b.recordid as recordid2 from #grouping1 as grouping1a inner join #grouping1 as grouping1b on grouping1a.groupid=grouping1b.groupid where not (grouping1a.SYSTEMID=grouping1B.SYSTEMID AND grouping1A.recordID=grouping1B.recordID) ) as OldTable select * into #NewTable from ( select grouping1a.groupid, grouping1a.systemid as systemid1,grouping1a.recordid as recordid1, grouping1b.systemid as systemid2,grouping1b.recordid as recordid2 from #grouping2 as grouping1a inner join #grouping2 as grouping1b on grouping1a.groupid=grouping1b.groupid where not (grouping1a.SYSTEMID=grouping1B.SYSTEMID AND grouping1A.recordID=grouping1B.recordID) ) As NewTable select distinct #OldTable.groupid from #oldtable full outer join #newtable on #oldtable.systemid1=#newtable.systemid1 and #oldtable.recordid1=#newtable.recordid1 and #oldtable.systemid2=#newtable.systemid2 and #oldtable.recordid2=#newtable.recordid2 where #oldtable.systemid1 is null or #newtable.systemid1 is null

This would show me all the groups that have a problem. I cannot do this because there are over 100,000,000 rows in #Grouping1 and #Grouping2. Is there another way to approach this e.g. using a Checksum or HashByte?

Gordon Linoff · Accepted Answer · 2015-10-16 16:21:25Z

I think you are looking for checksum_agg(), documented here. If you were doing this on names:

select groupId, checksum_agg(name) from #Grouping2 group by groupId;

Or by system/record:

select groupId, checksum_agg(cast(systemId as varchar(255)) + ':' + cast(recordId as varchar(255))) from #Grouping2 group by groupId;

You can then use full outer join to find differences between the summary tables. You seem to understand that part of the putting the query together.

+1 for Checksum_agg(). I believe that is what I was looking for. I am doing some testing now. I will report back.
@ Gordon Linoff, I have asked another question following on from this here: stackoverflow.com/questions/33174872/comparing-grouped-data. Could you take a look please?

Collectives™ on Stack Overflow

Comparing grouped data

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related