Delete duplicates from two columns

Question

I have a table with the following schema :

+---------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | system_one_id | int(11) | NO | MUL | NULL | | | system_two_id | int(11) | NO | MUL | NULL | | | type | smallint(6) | NO | | NULL | | +---------------+-------------+------+-----+---------+----------------+

I want to delete duplicates, where "duplicate" is defined as either:

matching values for both system_one_id and system_two_id between two rows, or
"cross matched" values, ie row1.system_one_id = row2.system_two_id and row1.system_two_id = row2.system_one_id

Is there a way to delete both kinds of duplicates in one query?

I'm using mySQL, but I'd like to be as RDBMS-agnostic as possible. — user4083185
– user4083185, Commented Mar 29, 2015 at 21:34
another important question: if you have 3 duplicate record which one of them you want to delete? — void
– void, Commented Mar 29, 2015 at 21:40

Bohemian · Accepted Answer · 2015-03-29 23:16:49Z

Mysql supports multi-table deletes, so a straightforward join can be used:

delete t1 from mytable t1 join mytable t2 on t1.id > t2.id and ((t1.system_one_id = t2.system_one_id and t1.system_two_id = t2.system_two_id) or (t1.system_one_id = t2.system_two_id and t1.system_two_id = t2.system_one_id))

The join condition t1.id > t2.id prevents rows joining to themselves and selects the later added row of a duplicate pair to be the one deleted.

FYI, in postgres, similar functionality exists, but with different syntax:

delete mytable t1 using mytable t2 where t1.id > t2.id and ((t1.system_one_id = t2.system_one_id and t1.system_two_id = t2.system_two_id) or (t1.system_one_id = t2.system_two_id and t1.system_two_id = t2.system_one_id))

@theofabry Yes, you can do something like this in postgres, but unfortunately the syntax is different (it's non-standard SQL functionality, and each invented their own syntax to express it). See edit to my answer for the postgres version

Udontknow · Accepted Answer · 2015-03-29 21:36:02Z

Here is a statement (hopefully) selecting all ids of duplicate records, you only need to wrap it with a delete command (that´s your part). ;-)

select A.ID from MYTABLE A left join MYTABLE B on ( (A.SYSTEM_ONE_ID = B.SYSTEM_ONE_ID and A.SYSTEM_TWO_ID = B.SYSTEM_TWO_ID) or (A.SYSTEM_ONE_ID = B.SYSTEM_TWO_ID AND A.SYSTEM_TWO_ID = B.SYSTEM_ONE_ID) ) where B.ID is not null and A.ID <> B.ID;

Your statement seems to work, but I can't manage to wrap it into a delete command, I do this : DELETE FROM Link C WHERE C.id IN (your_statement) (replacing MYTABLE by Link of course), and I get this : ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'C where C.id IN(select A.id from Link A left join Link B on ((A.system_one_id=B.' at line 1

FuzzyTree · Accepted Answer · 2015-03-29 22:05:19Z

0

You can group by least and greatest to select the minimum id of each group and delete rows with other id's.

delete from mytable where id not in ( select * from ( select min(id) from mytable group by greatest(system_one_id, system_two_id), least(system_one_id, system_two_id) ) t1 )

edited Mar 29, 2015 at 22:05

answered Mar 29, 2015 at 21:38

FuzzyTree

32.4k3 gold badges58 silver badges87 bronze badges

3 Comments

void Over a year ago

I think the logic is wrong, this query assumes (system_one_id=1,system_two_id=3) and (system_one_id=2,system_two_id=2) as duplicate which are not. also I'm not sure is it possible to do delete and select at same time on a table?

FuzzyTree Over a year ago

@Farhęg thanks, I've added least(system_one_id, system_two_id) to the group by to make sure the system id's are always the same. it is possible to delete at the same time if you wrap the subquery into a derived table like above

void Over a year ago

Great, now it's OK I think.

void · Accepted Answer · 2015-03-29 22:12:50Z

this query starts from min id and then selects only not selected records in previous selection with regard to system_ids (t.id > t2.id)

delete from your_table t where id not in (select id from (select distinct t.id from your_table t where ( select count(*) from your_table t2 where t.id > t2.id and ((t.system_one_id=t2.system_one_id and t.system_two_id=t2.system_two_id) or (t.system_one_id=t2.system_two_id and t.system_two_id=t2.system_one_id)) ) =0 ) tbl )

Collectives™ on Stack Overflow

Delete duplicates from two columns

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Related