T-SQL Select duplicate rows where at least 1 column has a value

Question

I have a database table as follows:

ProductDetails ----------------- ProductDetailsID int ProductIdentifier VARCHAR (20) ProductID int ProductFile VARCHAR(255) ProductAvailability char(2) RightsCountry varchar(MAX) Deleted bit

There was a bug in the platform recently that allowed a large number of duplicates in. So I could have multiple ProductDetails entries which are the same EXCEPT for the ProductDetailsID (PK) and ProductFile (this is null, for some reason the duplicates didn't insert the files).

I need to write a T-SQL script that finds these duplicates with a view to deleting them (after examination).

I have found this online, which is great. It gives me the ProductIdentifier with several records, and the number of duplicates.

SELECT pd.ProductIdentifier, COUNT(pd.ProductIdentifier) AS NumOccurrences FROM dbo.ProductDetails pd GROUP BY pd.ProductIdentifier HAVING ( COUNT(pd.ProductIdentifier) > 1 )

The thing is, some of these records should remain. I need to select the ProductDetail records that have duplicate ProductIdentifiers, where at least 1 of the duplicates has a FileName and all other columns are exactly the same. For example, if I have a dataset as follows:

ProductDetailsID | ProductIdentifier | ProductID | ProductFile | ProductAvailability | RightsCountry | Deleted 123 | 567890 | 12 | filename.png | 20 | AU CX CC CK HM NZ NU NF TK | 0 124 | 567890 | 12 | (NULL) | 20 | AU CX CC CK HM NZ NU NF TK | 0 125 | 567890 | 12 | (NULL) | 20 | AU CX CC CK HM NZ NU NF TK | 0

I need to return ProductDetailsID 124 and 125 as these are for deletion. I'd appreciate any guidance or links to examples or any help at all!

"at least 1 of the duplicates has a FileName" - what if more than 1 of the duplicates has a filename? Will they always match (or, if the rows match on all values except the filename, are they not considered duplicates)? — Damien_The_Unbeliever
– Damien_The_Unbeliever, Commented Apr 22, 2014 at 13:56
Hi @Damien_The_Unbeliever There could be more than 1 of the duplicates with a filename. I don't want to delete (select) any record with a filename. — 109221793
– 109221793, Commented Apr 22, 2014 at 14:08

Damien_The_Unbeliever · Accepted Answer · 2014-04-22 14:25:32Z

This works and is a cheeky way of abusing the partitioned window functions:

declare @ProductDetails table( ProductDetailsID int not null, ProductIdentifier VARCHAR(20) not null, ProductID int not null, ProductFile VARCHAR(255) null, ProductAvailability char(2) not null, RightsCountry varchar(MAX) not null, Deleted bit not null ) insert into @ProductDetails(ProductDetailsID,ProductIdentifier,ProductID, ProductFile,ProductAvailability,RightsCountry,Deleted) values (123,567890,12,'filename.png',20,'AU CX CC CK HM NZ NU NF TK',0), (124,567890,12,NULL,20,'AU CX CC CK HM NZ NU NF TK',0), (125,567890,12,NULL,20,'AU CX CC CK HM NZ NU NF TK',0) ;With FillInFileNames as ( select *, MAX(ProductFile) OVER (PARTITION BY ProductIdentifier,ProductID, ProductAvailability,RightsCountry,Deleted) as AnyFileName from @ProductDetails ) select * from FillInFileNames where ProductFile is null and AnyFileName is not null

And the fact that aggregate functions will never return NULL if at least one input value wasn't NULL.

Result:

ProductDetailsID ProductIdentifier ProductID ProductFile ProductAvailability RightsCountry Deleted AnyFileName ---------------- -------------------- ----------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 124 567890 12 NULL 20 AU CX CC CK HM NZ NU NF TK 0 filename.png 125 567890 12 NULL 20 AU CX CC CK HM NZ NU NF TK 0 filename.png

It may also be instructive for the OP to observe that the top of my script isn't much more than the table information and sample data provided in their question - except mine is actually runnable.

It might be worth considering writing your samples in such a style in the future, because that way it can be immediately copy & pasted from your question into a query window.

Thanks Damien, will take your advice onboard for future questions :) Thanks for the good answer !

dean · Accepted Answer · 2014-04-22 14:23:49Z

1

create view rows_to_delete as select * from ( select *, row_number() over(partition by ProductIdentifier order by ProductFile desc, ProductDetailsID) as rn from t ) x where rn > 1 and ProductFile is null

edited Apr 22, 2014 at 14:23

answered Apr 22, 2014 at 14:07

dean

10.1k2 gold badges27 silver badges27 bronze badges

2 Comments

Damien_The_Unbeliever Over a year ago

The OP has since clarified that there may be multiple rows with non-NULL ProductFile values and all of them should be preserved.

dean Over a year ago

Unfortunatelly only after I've posted the answer. OK, added another condition to account for this.

Joseph B · Accepted Answer · 2014-04-22 14:09:39Z

The following query first gets the counts and then filters those records whose fields match with the entry with the fileName.

WITH counts as ( SELECT ProductDetailsID , ProductIdentifier , ProductID , ProductFile , ProductAvailability , RightsCountry , Deleted , COUNT(*) as cnt_files FROM dbo.ProductDetails pd GROUP BY ProductDetailsID , ProductIdentifier , ProductID , ProductFile , ProductAvailability , RightsCountry , Deleted ) SELECT c1.* FROM counts c1 INNER JOIN counts c2 ON c1.ProductIdentifier = c2.ProductIdentifier AND c1.ProductID = c2.ProductID AND c1.ProductFile = c2.ProductFile AND c1.ProductAvailability = c2.ProductAvailability AND c1.RightsCountry = c2.RightsCountry AND c1.Deleted = c2.Deleted WHERE ProductFile is NULL;

Collectives™ on Stack Overflow

T-SQL Select duplicate rows where at least 1 column has a value

3 Answers 3

1 Comment

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Related