2

I have a table with the following schema :

+---------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | system_one_id | int(11) | NO | MUL | NULL | | | system_two_id | int(11) | NO | MUL | NULL | | | type | smallint(6) | NO | | NULL | | +---------------+-------------+------+-----+---------+----------------+ 

I want to delete duplicates, where "duplicate" is defined as either:

  1. matching values for both system_one_id and system_two_id between two rows, or
  2. "cross matched" values, ie row1.system_one_id = row2.system_two_id and row1.system_two_id = row2.system_one_id

Is there a way to delete both kinds of duplicates in one query?

2
  • I'm using mySQL, but I'd like to be as RDBMS-agnostic as possible. Commented Mar 29, 2015 at 21:34
  • another important question: if you have 3 duplicate record which one of them you want to delete? Commented Mar 29, 2015 at 21:40

4 Answers 4

1

Mysql supports multi-table deletes, so a straightforward join can be used:

delete t1 from mytable t1 join mytable t2 on t1.id > t2.id and ((t1.system_one_id = t2.system_one_id and t1.system_two_id = t2.system_two_id) or (t1.system_one_id = t2.system_two_id and t1.system_two_id = t2.system_one_id)) 

The join condition t1.id > t2.id prevents rows joining to themselves and selects the later added row of a duplicate pair to be the one deleted.


FYI, in postgres, similar functionality exists, but with different syntax:

delete mytable t1 using mytable t2 where t1.id > t2.id and ((t1.system_one_id = t2.system_one_id and t1.system_two_id = t2.system_two_id) or (t1.system_one_id = t2.system_two_id and t1.system_two_id = t2.system_one_id)) 
Sign up to request clarification or add additional context in comments.

1 Comment

@theofabry Yes, you can do something like this in postgres, but unfortunately the syntax is different (it's non-standard SQL functionality, and each invented their own syntax to express it). See edit to my answer for the postgres version
1

Here is a statement (hopefully) selecting all ids of duplicate records, you only need to wrap it with a delete command (that´s your part). ;-)

select A.ID from MYTABLE A left join MYTABLE B on ( (A.SYSTEM_ONE_ID = B.SYSTEM_ONE_ID and A.SYSTEM_TWO_ID = B.SYSTEM_TWO_ID) or (A.SYSTEM_ONE_ID = B.SYSTEM_TWO_ID AND A.SYSTEM_TWO_ID = B.SYSTEM_ONE_ID) ) where B.ID is not null and A.ID <> B.ID; 

1 Comment

Your statement seems to work, but I can't manage to wrap it into a delete command, I do this : DELETE FROM Link C WHERE C.id IN (your_statement) (replacing MYTABLE by Link of course), and I get this : ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'C where C.id IN(select A.id from Link A left join Link B on ((A.system_one_id=B.' at line 1
0

You can group by least and greatest to select the minimum id of each group and delete rows with other id's.

delete from mytable where id not in ( select * from ( select min(id) from mytable group by greatest(system_one_id, system_two_id), least(system_one_id, system_two_id) ) t1 ) 

3 Comments

I think the logic is wrong, this query assumes (system_one_id=1,system_two_id=3) and (system_one_id=2,system_two_id=2) as duplicate which are not. also I'm not sure is it possible to do delete and select at same time on a table?
@Farhęg thanks, I've added least(system_one_id, system_two_id) to the group by to make sure the system id's are always the same. it is possible to delete at the same time if you wrap the subquery into a derived table like above
Great, now it's OK I think.
0

this query starts from min id and then selects only not selected records in previous selection with regard to system_ids (t.id > t2.id)

delete from your_table t where id not in (select id from (select distinct t.id from your_table t where ( select count(*) from your_table t2 where t.id > t2.id and ((t.system_one_id=t2.system_one_id and t.system_two_id=t2.system_two_id) or (t.system_one_id=t2.system_two_id and t.system_two_id=t2.system_one_id)) ) =0 ) tbl ) 

Comments