0

I've got a table that has duplicate data that needs to be cleaned up. Consider the following example:

CREATE TABLE #StackOverFlow ( [ctrc_num] int, [Ctrc_name] varchar(6), [docu] bit, [adj] bit, new bit, [some_date] datetime ); INSERT INTO #StackOverFlow ([ctrc_num], [Ctrc_name], [docu], [adj], [new], [some_date]) VALUES (12345, 'John R', null, null, 1, '2023-12-11 09:05:13.003'), (12345, 'John R', 1, null, 0, '2023-12-11 09:05:12.987'), (12345, 'John R', null, null, 1, '2023-12-11 09:05:12.947'), (56789, 'Sam S', null, null, 1, '2023-12-11 09:05:13.003'), (56789, 'Sam S', null, null, 1, '2023-12-11 09:05:12.987'), (56789, 'Sam S', 1, null, 0, '2023-12-11 09:05:12.947'), (78945, 'Pat P', null, null, 1, '2023-12-11 09:05:13.003'), (78945, 'Pat P', null, null, 1, '2023-12-11 09:05:12.987'), (78945, 'Pat P', null, null, 1, '2023-12-11 09:05:12.947'); 

This gives me:

[ctrc_num] [Ctrc_name] [docu] [adj] [new] [some_date] ----------------------------------------------------------------------- 12345 John R NULL NULL 1 2023-12-11 09:05:13.003 12345 John R 1 NULL 0 2023-12-11 09:05:12.987 12345 John R NULL NULL 1 2023-12-11 09:05:12.947 56789 Sam S NULL NULL 1 2023-12-11 09:05:13.003 56789 Sam S NULL NULL 1 2023-12-11 09:05:12.987 56789 Sam S 1 NULL 0 2023-12-11 09:05:12.947 78945 Pat P NULL NULL 1 2023-12-11 09:05:13.003 78945 Pat P NULL NULL 1 2023-12-11 09:05:12.987 78945 Pat P NULL NULL 1 2023-12-11 09:05:12.947 

What I need to do is delete from the table duplicates. If new is 0, delete the records where new is 1. If all records have new = 1 keep the newest record and delete the older ones.

The result should look like this:

[ctrc_num] [Ctrc_name] [docu] [adj] [new] [some_date] ----------------------------------------------------------------------- 12345 John R 1 NULL 0 2023-12-11 09:05:12.987 56789 Sam S 1 NULL 0 2023-12-11 09:05:12.947 78945 Pat P NULL NULL 1 2023-12-11 09:05:13.003 

I've tried ROW_NUMBER:

;WITH RankedByDate AS ( SELECT ctrc_num, Ctrc_name, docu, adj, new, some_date, ROW_NUMBER() OVER (PARTITION BY Ctrc_num, Ctrc_name, [docu],[adj], [new] ORDER BY some_date DESC) AS rNum FROM #StackOverFlow ) SELECT * FROM RankedByDate 

This separates the ones with new = 0, but I still have the ones with new = 1 that are ordered.

Grouping gives me the records that are duplicated but no way to delete the ones needed to be deleted:

SELECT [ctrc_num] ,[Ctrc_name] ,[docu] ,[adj] ,[new] FROM #StackOverFlow GROUP BY [ctrc_num] ,[Ctrc_name] ,[docu] ,[adj] ,[new] HAVING COUNT(*) > 1 
7
  • What constitutes a duplicate? Same [ctrc_num] and [Ctrc_name]? Commented Dec 15, 2023 at 16:09
  • 1
    Post the query you've tried, even if not working. Commented Dec 15, 2023 at 16:10
  • There are no duplicate rows since no two rows are equal. Therefore you must specify what you mean by duplicate. Also, which value of the rows not building the duplicate do you want to keep? Commented Dec 15, 2023 at 16:18
  • Unless there can be more than one new = 0, your logic can be summarized as remove all rows partitioned by ctrc_num order by new, some_date desc where row_number > 1. It shouldn't be very hard to come up with sql corresponding to the above. Commented Dec 15, 2023 at 16:21
  • Duplicates are the same [ctrc_num and [Ctrc_name] Commented Dec 15, 2023 at 16:21

2 Answers 2

2

Break the problem down into it's parts

  1. "If new is 0, delete the records where new is 1"

    delete from #StackOverFlow where [new] = 1 and [ctrc_num] in (select [ctrc_num] from #StackOverFlow where [new] = 0); 
  2. "If all records have new = 1 keep the newest record and delete the older ones" Use a CTE to add a row number based on the date and partitioned by the [ctrc_num] such that the "first" record in each group is the one you want to keep - if there is only 1 row in a group that's the one you want to keep anyway. Then delete everything else

    ;with cte as ( select [ctrc_num] ,ROW_NUMBER() OVER (PARTITION BY [ctrc_num] ORDER BY [ctrc_num], [some_date] DESC) as rw from #StackOverFlow ) DELETE FROM cte where rw <> 1; 
Sign up to request clarification or add additional context in comments.

6 Comments

This is exactly what I was looking for. I was hoping I would be able to eliminate the duplicate without having to break it into more than one part, but this works.
you can write this as subquery too, no need for CTE.
@TN - Why? In step 1 I deleted any records where new = 1 if there was a subsequent new = 0. So either there is only a single record per [ctrc_num] and new = 1 OR there is/are 1+ records for a [ctrc_num] where new = 0. Sorting by new only becomes relevant if trying to do both steps at once.
@kool_kris - as you will see from siggemannen's solution, it is possible to do what you want in a single query. But when you are trying to figure out how to do something it is good practice to break it down first. See also "SQL Antipatterns" by Bill Karwin - Chapter 18 "Spaghetti Query" - "Solve a Complex Problem in One Step". You can always merge the "bits" together afterwards - once you have something working. Personally, I'd rather have three simple queries I can follow than one complex one that has me puzzled :-)
@TN - Phew. I did scratch my head for a while though - at least you made me think :-)
|
2

It is possible to do what you want is a single query.

;with cte as( select [ctrc_num], [Ctrc_name], [docu],[adj], [new], [some_date] ,ROW_NUMBER() over(partition by [ctrc_num] -- group by [ctrc_num] order by [new], --0 then 1 [some_date] desc --newest first ) rn from #StackOverFlow ) delete cte where rn>1 ; select * from #StackOverFlow 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.