SQL Server query to find duplicate records with join query

Question

I am using SQL Server database which has tables x, y and mapping table xy.

Table: x

x_id date text ------------------------------------------- | 1 | 2017-02-22 20:40:30.617 | txt1 | | 2 | 2017-02-22 20:40:06.103 | txt1 | | 3 | 2017-02-22 20:28:21.393 | txt2 |

Table: XY

x_id y_id ----------- | 1 | 3 | 1 | 10 | 2 | 3 | 2 | 10 | 3 | 5

I have a table X with x_id, date, text and mapping table xy with x_id and y_id. I need a query to find which records of x are duplicate. Records of x can be considered as duplicate when it satisfies all of the below conditions

Both has the same text
Date should be in the interval of 5 minutes.
Both should have same y_id's (in XY mapping table)

I am able to write a query to satisfy first two conditions (though with duplicate data). But I wasn't able to write a query to satisfy 3rd condition and to display distinct data when self join is performed.

SqlZim · Accepted Answer · 2017-02-23 18:37:07Z

The easiest way I can think of to satisfy the third criteria is to aggregate the y_ids into one row for comparison.

Using a common table expression (cte) makes this more readable for me, but it can be written without it as well.

test setup: http://rextester.com/APVZQS37775

create table x( x_id int , [date] datetime , text varchar(32) ); insert into x values ( 1,'2017-02-22 20:40:30.617','txt1') ,( 2,'2017-02-22 20:40:06.103','txt1') ,( 3,'2017-02-22 20:28:21.393','txt2'); create table xy ( x_id int , y_id int ); insert into xy values ( 1,3 ) ,( 1,10) ,( 2,3 ) ,( 2,10) ,( 3,5 );

query:

;with cte as ( select x.* , y_ids = stuff(( select ','+convert(varchar(10),xy.y_id) from xy where x.x_id = xy.x_id order by xy.y_id for xml path (''), type).value('.','varchar(max)') ,1,1,'') from x ) select * from cte where exists ( select 1 from cte as i where i.x_id <> cte.x_id and abs(datediff(minute,i.date,cte.date))<=5 and i.text = cte.text and i.y_ids = cte.y_ids )

results:

+------+---------------------+------+-------+ | x_id | date | text | y_ids | +------+---------------------+------+-------+ | 1 | 2017-02-22 20:40:30 | txt1 | 3,10 | | 2 | 2017-02-22 20:40:06 | txt1 | 3,10 | +------+---------------------+------+-------+

A method without aggregating the y_ids:

;with cte as ( select x.* , xy.y_id , cnt = count(*) over (partition by x.x_id) from x inner join xy on x.x_id = xy.x_id ) select x.x_id, x.date, x.text from cte as x inner join cte as x2 on x.x_id <> x2.x_id and x.y_id = x2.y_id and x.text = x2.text and x.cnt = x2.cnt and abs(datediff(minute,x.date,x2.date))<=5 group by x.x_id, x.date, x.text, x.cnt having count(*) = x.cnt

returns:

+------+---------------------+------+ | x_id | date | text | +------+---------------------+------+ | 1 | 2017-02-22 20:40:30 | txt1 | | 2 | 2017-02-22 20:40:06 | txt1 | +------+---------------------+------+

Hey @Sqlzim, I tried with your solution. It seems like it is showing the records even y_ids are different.
@Raghavendra That's because I forgot to add i.y_ids = cte.y_ids to the where. Fixed now.

Nolan Shang · Accepted Answer · 2017-02-22 22:31:46Z

It is another sample, If the x do not have relation data in XY, Does need ignore the condition 3? This sample will ignore the condtion 3.

 DECLARE @x TABLE(x_id int,[date] datetime, text varchar(10)) insert into @x values ( 1,'2017-02-22 20:40:30.617','txt1') ,( 2,'2017-02-22 20:40:06.103','txt1') ,( 3,'2017-02-22 20:28:21.393','txt2') ,( 4,'2017-02-22 20:28:21.393','txt3') ,( 5,'2017-02-22 20:28:21.394','txt3') DECLARE @xy TABLE(x_id int, y_id int) INSERT INTO @xy VALUES ( 1,3 ) ,( 1,10) ,( 2,3 ) ,( 2,10) ,( 3,5 ); SELECT x.*,xy.* FROM @x AS x INNER JOIN @x AS ox ON x.x_id!=ox.x_id AND x.text=ox.text AND ABS(DATEDIFF(MINUTE,x.date,ox.date))<=5 OUTER APPLY( SELECT COUNT(0) AS totaly, SUM(CASE WHEN xy1.y_id+xy2.y_id IS NULL THEN 1 ELSE 0 END) AS NULLROW FROM (SELECT y_id FROM @xy WHERE x_id=x.x_id) AS xy1 FULL JOIN (SELECT y_id FROM @xy WHERE x_id=ox.x_id) AS xy2 ON xy1.y_id=xy2.y_id ) AS xy WHERE (xy.totaly>0 and xy.NULLROW=0) OR (xy.totaly=0)

 x_id date text totaly NULLROW ----------- ----------------------- ---------- ----------- ----------- 1 2017-02-22 20:40:30.617 txt1 2 0 2 2017-02-22 20:40:06.103 txt1 2 0 4 2017-02-22 20:28:21.393 txt3 0 NULL 5 2017-02-22 20:28:21.393 txt3 0 NULL

Collectives™ on Stack Overflow

SQL Server query to find duplicate records with join query

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related