How to avoid inserting duplicate records when using a T-SQL Merge statement

Question

I am attempting to insert many records using T-SQL's MERGE statement, but my query fails to INSERT when there are duplicate records in the source table. The failure is caused by:

The target table has a Primary Key based on two columns
The source table may contain duplicate records that violate the target table's Primary Key constraint ("Violation of PRIMARY KEY constraint" is thrown)

I'm looking for a way to change my MERGE statement so that it either ignores duplicate records within the source table and/or will try/catch the INSERT statement to catch exceptions that may occur (i.e. all other INSERT statements will run regardless of the few bad eggs that may occur) - or, maybe, there's a better way to go about this problem?

Here's a query example of what I'm trying to explain. The example below will add 100k records to a temp table and then will attempt to insert those records in the target table -

EDIT In my original post I only included two fields in the example tables which gave way to SO friends to give a DISTINCT solution to avoid duplicates in the MERGE statement. I should have mentioned that in my real-world problem the tables have 15 fields and of those 15, two of the fields are a CLUSTERED PRIMARY KEY. So the DISTINCT keyword doesn't work because I need to SELECT all 15 fields and ignore duplicates based on two of the fields.

I have updated the query below to include one more field, col4. I need to include col4 in the MERGE, but I only need to make sure that ONLY col2 and col3 are unique.

-- Create the source table CREATE TABLE #tmp ( col2 datetime NOT NULL, col3 int NOT NULL, col4 int ) GO -- Add a bunch of test data to the source table -- For testing purposes, allow duplicate records to be added to this table DECLARE @loopCount int = 100000 DECLARE @loopCounter int = 0 DECLARE @randDateOffset int DECLARE @col2 datetime DECLARE @col3 int DECLARE @col4 int WHILE (@loopCounter) < @loopCount BEGIN SET @randDateOffset = RAND() * 100000 SET @col2 = DATEADD(MI,@randDateOffset,GETDATE()) SET @col3 = RAND() * 1000 SET @col4 = RAND() * 10 INSERT INTO #tmp (col2,col3,col4) VALUES (@col2,@col3,@col4); SET @loopCounter = @loopCounter + 1 END -- Insert the source data into the target table -- How do we make sure we don't attempt to INSERT a duplicate record? Or how can we -- catch exceptions? Or? MERGE INTO dbo.tbl1 AS tbl USING (SELECT * FROM #tmp) AS src ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3) WHEN NOT MATCHED THEN INSERT (col2,col3,col4) VALUES (src.col2,src.col3,src.col4); GO

You have to decide what row you should pick col4 from when there are duplicates for col2 and col3 in #tmp. For example, you can use group by col2, col3 and min(col4) as col4. — Mikael Eriksson
– Mikael Eriksson, Commented Jul 6, 2011 at 16:02

t-clausen.dk · Accepted Answer · 2011-07-06 16:01:25Z

Solved to your new specification. Only inserting the highest value of col4: This time I used a group by to prevent duplicate rows.

MERGE INTO dbo.tbl1 AS tbl USING (SELECT col2,col3, max(col4) col4 FROM #tmp group by col2,col3) AS src ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3) WHEN NOT MATCHED THEN INSERT (col2,col3,col4) VALUES (src.col2,src.col3,src.col4);

I made the mistake of using only two fields in my query example. The fact is, my target table has more than two field. However, it's only two fields that make up the PK Cluster. So the DISTINCT solution that you suggested will not suffice. I updated my original post to reflect the additional fields. Regardless, thanks for your og reply (+1 for answering the question as is).

Sarsaparilla · Accepted Answer · 2014-07-10 19:12:42Z

Instead of GROUP BY you can use an analytic function, allowing you to select a specific record in the set of duplicate records to merge.

MERGE INTO dbo.tbl1 AS tbl USING ( SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY col2, col3 ORDER BY ModifiedDate DESC) AS Rn FROM #tmp ) t WHERE Rn = 1 --choose the most recently modified record ) AS src ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3)

Solved my problem with MERGE when there are duplicate rows in the source table. Nice solution. Thanks for sharing.

gbn · Accepted Answer · 2011-07-06 07:15:57Z

10

Given the source has duplicates and you aren't using MERGE fully, I'd use an INSERT.

 INSERT dbo.tbl1 (col2,col3) SELECT DISTINCT col2,col3 FROM #tmp src WHERE NOT EXISTS ( SELECT * FROM dbo.tbl1 tbl WHERE tbl.col2 = src.col2 AND tbl.col3 = src.col3)

The reason MERGE fails is that it isn't checked row by row. All non-matches are found, then it tries to INSERT all these. It doesn't check for rows in the same batch that already match.

This reminds me a bit of the "Halloween problem" where early data changes of an atomic operation affect later data changes: it isn't correct

edited Jul 6, 2011 at 7:15

answered Jul 6, 2011 at 7:07

gbn

434k84 gold badges602 silver badges690 bronze badges

7 Comments

t-clausen.dk Over a year ago

I have not tested my own script, are you saying it fails ?

gbn Over a year ago

@t-clausen.dk: You have a DISTINCT too so should be OK. Given the limits of MERGE why not just use an INSERT I reckon...

gbn Over a year ago

@Jed: If the key is the same for 2 rows, why are other columns (col4 for your update) different? This implies an incomplete key, for example. There is no solution for this because we can never know which of the 2 rows to take...

Jed Over a year ago

@gbn - Although the other columns will most likely not be different, it is logically possible in my real-world situation. In the case where there are duplicate PK's with different col4 values, I am fine with selecting any of the duplicate records and discarding all others.

gbn Over a year ago

Just add a MAX(col4), GROUP BY col2, col3 then to my select. Add MAX for other columns too. Now, this brings another question if you have 2 rows: do you want values from one row only? If so, MAX won't do it. That is, if you add MAX(col5) then col4 and col5 may come from different rows. I would add that if you don't care what row then those columns should be ignored completely

|

Joona · Accepted Answer · 2023-04-19 20:21:17Z

Found the following MERGE statement that answers the question but with an emphasis on keeping a source table intact and still being able to handle the duplicates in a target.

The general situation:
-The source table has id column AA and value column AB: multiple id's in AA can correspond to a single value in AB
-The target table has column AB allowing unique values only
->Source must be at the same time squeezed (to fulfill the constraint in AB) but would need to be intact for an output-clause (to fulfill the need for multiple AA corresponding one AB)

To overcome this, the answer combines MERGE with ranking functions Dense_rank and Row_number.

Edit 2022-April-08

create table ##tmp (id0 int, id1 int) merge [Target] using ( select dense_rank()over(order by AB)[t],row_number()over(partition by AB order by AB)[tt],... ... ) as aa on 1=0 when not matched and aa.tt=1 then --row_number() is utilized here to perform the deduplication insert... .... output inserted.id,aa.t into ##tmp; select * from ( ...[source]... )aa left join ##tmp t on t.id1 = aa.t --dense_rank() is utilized here to reach the original AA id values and match them with AB values

Collectives™ on Stack Overflow

How to avoid inserting duplicate records when using a T-SQL Merge statement

4 Answers 4

1 Comment

1 Comment

7 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

7 Comments

Comments

Linked

Related