We have a table business_users with a user_id and business_id and we have duplicates. How can I write a query that will delete all duplicates except for one?
2 Answers
Completely identical rows
If you want to avoid completely identical rows, as I understood your question at first, then you can select unique rows to a separate table and recreate the table data from that.
CREATE TEMPORARY TABLE tmp SELECT DISTINCT * FROM business_users; DELETE FROM business_users; INSERT INTO business_users SELECT * FROM tmp; DROP TABLE tmp; Be careful if there are any foreign key constraints referencing this table, though, as the temporary deletion of rows might lead to cascaded deletions elsewhere.
Introducing a unique constraint
If you only care about pairs of user_id and business_id, you probably want to avoid introducing duplicates in the future. You can move the existing data to a temporary table, add a constraint, and then move the table data back, ignoring duplicates.
CREATE TEMPORARY TABLE tmp SELECT * FROM business_users; DELETE FROM business_users; ALTER TABLE business_users ADD UNIQUE (user_id, business_id); INSERT IGNORE INTO business_users SELECT * FROM tmp; DROP TABLE tmp; The above answer is based on this answer. The warning about foreign keys applies just as it did in the section above.
One-shot removal
If you only want to execute a single query, without modifying the table structure in any way, and you have a primary key id identifying each row, then you can try the following:
DELETE FROM business_users WHERE id NOT IN (SELECT MIN(id) FROM business_users GROUP BY user_id, business_id); A similar idea was previously suggested by this answer.
If the above request fails, because you are not allowed to read and delete from a table in the same step, you can again use a temporary table:
CREATE TEMPORARY TABLE tmp SELECT MIN(id) id FROM business_users GROUP BY user_id, business_id; DELETE FROM business_users WHERE id NOT IN (SELECT id FROM tmp); DROP TABLE tmp; If you want to, you can still introduce a uniqueness constraint after cleaning the data in this fashion. To do so, execute the ALTER TABLE line from the previous section.
4 Comments
SELECT MIN(id) FROM and the second one have SELECT MIN(id) id FROM (the second has two 'id's)?MIN(id) id in the second is an abbreviation of MIN(id) AS id: it specifies the name of the column, so that the column in the resulting table isn't literally named MIN(id) which would be quite confusing and hard to type. In the first query, the name of the column does not matter, since the subquery is just used as a set.Since you have a primary key, you can use that to pick which rows to keep:
delete from business_users where id not in ( select id from ( select min(id) as id -- Make a list of the primary keys to keep from business_users group by user_id, business_id -- Group by your duplicated row definition ) as a -- Derived table to force an implicit temp table ); In this way, you won't need to create/drop temp tables and such (except the implicit one).
You might want to put a unique constraint on user_id, business_id so you don't have to worry about this again.
4 Comments
business_users as a temporary table as well, for testing. In that case, the error is phrased Can't reopen table: 'business_users' which amounts to pretty much the same problem (at least in my eyes), but cannot be avoided by introducing yet another subquery.
user_idandbusiness_idthe only columns, such that entire rows are duplicated?idas the primary key