Find and Delete Duplicate rows in MySQL

Question

I'm having trouble finding duplicates in a database table with the following setup:

========================================================================== | stock_id | product_id | store_id | stock_qty | updated_at | ========================================================================== | 9990 | 51 | 1 | 13 | 2014-10-25 16:30:01 | | 9991 | 90 | 2 | 5 | 2014-10-25 16:30:01 | | 9992 | 161 | 1 | 3 | 2014-10-25 16:30:01 | | 9993 | 254 | 1 | 18 | 2014-10-25 16:30:01 | | 9994 | 284 | 2 | 12 | 2014-10-25 16:30:01 | | 9995 | 51 | 1 | 11 | 2014-10-25 17:30:02 | | 9996 | 90 | 2 | 5 | 2014-10-25 17:30:02 | | 9997 | 161 | 1 | 3 | 2014-10-25 17:30:02 | | 9998 | 254 | 1 | 16 | 2014-10-25 17:30:02 | | 9999 | 284 | 2 | 12 | 2014-10-25 17:30:02 | ==========================================================================

Stock updates are imported into this table every hour, I'm trying to find duplicate stock entries (any rows which have a matching product id and store id) so I can delete the oldest. The query below is my attempt, by comparing product ids and store ids on a join like this I can find one set of duplicates:

SELECT s.`stock_id`, s.`product_id`, s.`store_id`, s.`stock_qty`, s.`updated_at` FROM `stock` s INNER JOIN `stock` j ON s.`product_id`=j.`product_id` AND s.`store_id`=j.`store_id` GROUP BY `stock_id` HAVING COUNT(*) > 1 ORDER BY s.updated_at DESC, s.product_id ASC, s.store_id ASC, s.stock_id ASC;

While this query will work, it doesn't find ALL duplicates, only 1 set, which means if an import goes awry and isn't noticed until the morning, there's a possibility that we'll be left with tons of duplicate stock entries. My MySQL skills are sadly lacking and I'm at a complete loss about how to find and delete all duplicates in a fast, reliable manner.

Any help or ideas are welcome. Thanks

hi @jbutler483, great question, the system we're working in has an ORM that requires the existence of a single primary key. We really on the ORM in other ways for scaffolding the CRUD management via the same system. — Herbage Onion
– Herbage Onion, Commented Nov 27, 2014 at 16:04
Plus the database will keep the oldest data, which is not the required functionality :) — Artjoman
– Artjoman, Commented Nov 27, 2014 at 16:05
@HerbageOnion I can't see why it matters that there's a single PK - as long as you can have a unique key constraint on (product_id,store_id). Then INSERT... ON DUPLICATE KEY UPDATE will solve the problem — Strawberry
– Strawberry, Commented Nov 27, 2014 at 17:06

Kruti Patel · Accepted Answer · 2016-08-03 12:21:26Z

You can use this query:

DELETE st FROM stock st, stock st2 WHERE st.stock_id < st2.stock_id AND st.product_id = st2.product_id AND st.store_id = st2.store_id;

This query will delete older record having same product_id and store_id and will keep latest record.

Hartmut Holzgraefe · Accepted Answer · 2014-11-27 16:34:18Z

A self join on store_id, product_id and 'is older' in combination with DISTINCT should give you all rows where also a newer version exists:

> SHOW CREATE TABLE stock; CREATE TABLE `stock` ( `stock_id` int(11) NOT NULL, `product_id` int(11) DEFAULT NULL, `store_id` int(11) DEFAULT NULL, `stock_qty` int(11) DEFAULT NULL, `updated_at` datetime DEFAULT NULL, PRIMARY KEY (`stock_id`) > select * from stock; +----------+------------+----------+-----------+---------------------+ | stock_id | product_id | store_id | stock_qty | updated_at | +----------+------------+----------+-----------+---------------------+ | 1 | 1 | 1 | 1 | 2001-01-01 12:00:00 | | 2 | 2 | 2 | 1 | 2001-01-01 12:00:00 | | 3 | 2 | 2 | 1 | 2002-01-01 12:00:00 | +----------+------------+----------+-----------+---------------------+ > SELECT DISTINCT s1.stock_id, s1.store_id, s1.product_id, s1.updated_at FROM stock s1 JOIN stock s2 ON s1.store_id = s2.store_id AND s1.product_id = s2.product_id AND s1.updated_at < s2.updated_at; +----------+----------+------------+---------------------+ | stock_id | store_id | product_id | updated_at | +----------+----------+------------+---------------------+ | 2 | 2 | 2 | 2001-01-01 12:00:00 | +----------+----------+------------+---------------------+ > DELETE stock FROM stock JOIN stock s2 ON stock.store_id = s2.store_id AND stock.product_id = s2.product_id AND stock.updated_at < s2.updated_at; Query OK, 1 row affected (0.02 sec) > select * from stock; +----------+------------+----------+-----------+---------------------+ | stock_id | product_id | store_id | stock_qty | updated_at | +----------+------------+----------+-----------+---------------------+ | 1 | 1 | 1 | 1 | 2001-01-01 12:00:00 | | 3 | 2 | 2 | 1 | 2002-01-01 12:00:00 | +----------+------------+----------+-----------+---------------------+

Which can be combined with DELETE by only selecting s1.stock_id and combining that query with DELETE as ` DELETE FROM stock WHERE stock_id IN (SELECT DISTINCT s1.stock_id FROM ...) `
Thanks for your help Hartmut, did you mean to have s1.store_id = s2.store_id instead of s1.stock_id = s2.stock_id? I've tried that and I don't get any results...
Just to be clear, there are no results but also no errors. Thanks for your help so far.
I tried it again, I think the sq is another typo, so I used s1, still no results though

Artjoman · Accepted Answer · 2014-11-27 17:03:04Z

Or you can use a stored Procedure:

DELIMITER // DROP PROCEDURE IF EXISTS removeDuplicates; CREATE PROCEDURE removeDuplicates( stockID INT ) BEGIN DECLARE stockToKeep INT; DECLARE storeID INT; DECLARE productID INT; -- gets the store and product value SELECT DISTINCT store_id, product_id FROM stock WHERE stock_id = stockID LIMIT 1 INTO storeID, productID; SELECT stock_id FROM stock WHERE product_id = productID AND store_id = storeID ORDER BY updated_at DESC LIMIT 1 INTO stockToKeep; DELETE FROM stock WHERE product_id = productID AND store_id = storeID AND stock_id != stockToKeep; END // DELIMITER ;

And afterwards call it for every pair of the product id and store id via a cursor procedure: DELIMITER // CREATE PROCEDURE updateTable() BEGIN DECLARE done BOOLEAN DEFAULT FALSE; DECLARE stockID INT UNSIGNED; DECLARE cur CURSOR FOR SELECT DISTINCT stock_id FROM stock; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;

 OPEN cur; testLoop: LOOP FETCH cur INTO stockID; IF done THEN LEAVE testLoop; END IF; CALL removeDuplicates(stockID); END LOOP testLoop; CLOSE cur; END// DELIMITER ;

And then just call the second procedure

CALL updateTable();

Hi thanks for the detailed answer, I have to be honest, this scares the crap out of me, I don't doubt you know your stuff and that's probably one hell of a solution but I need to be able to maintain the solution I implement and I don't think I could with that :( Thanks very much for your time, I'll be deconstructing your solution to learn more about stored procedures!! :)

Collectives™ on Stack Overflow

Find and Delete Duplicate rows in MySQL

3 Answers 3

Comments

9 Comments

1 Comment

Linked

Hot Network Questions