How to delete duplicate rows without unique identifier

Question

I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:

WITH TempEmp AS ( SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount FROM mytable ) -- Now Delete Duplicate Records DELETE FROM TempEmp WHERE duplicateRecCount > 1;

But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE after the WITH clause?

If it's a one time job - why wouldn't you run it in postgresql console? — zerkms
– zerkms, Commented Nov 6, 2014 at 0:02
not it is not one time job but it is weekly and we always get some duplicate values. thanks — moe
– moe, Commented Nov 6, 2014 at 0:06
why do you get duplicate values? What if you just don't put it there at first place? — zerkms
– zerkms, Commented Nov 6, 2014 at 3:02
Are duplicates defined by the columns (name, address, zipcode)? Are there other columns? Are those irrelevant? Different? Is any combination of columns unique? If some columns differ between duplicates, which row out of each set do you want to keep? — Erwin Brandstetter
– Erwin Brandstetter, Commented Nov 6, 2014 at 6:13
WORKS FOR POSTGRESQL (ALSO WORKS IN AWS REDSHIFT) View the answer to this question on another page — Golokesh Patra
– Golokesh Patra, Commented Aug 10, 2017 at 9:02

isapir · Accepted Answer · 2018-03-09 22:07:58Z

I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING keyword:

DELETE FROM table_with_dups T1 USING table_with_dups T2 WHERE T1.ctid < T2.ctid -- delete the "older" ones AND T1.name = T2.name -- list columns that define duplicates AND T1.address = T2.address AND T1.zipcode = T2.zipcode;

If you want to review the records before deleting them, then simply replace DELETE with SELECT * and USING with a comma ,, i.e.

SELECT * FROM table_with_dups T1 , table_with_dups T2 WHERE T1.ctid < T2.ctid -- select the "older" ones AND T1.name = T2.name -- list columns that define duplicates AND T1.address = T2.address AND T1.zipcode = T2.zipcode;

Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...) clause as those generate a lot of rows in the subquery.

If you rewrite the query to use IN (...) then it performs similarly to the solution presented here, but the SQL code becomes much less concise.

Update 2: If you have NULL values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE() in the condition for that column, e.g.

 AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')

Erwin's answer is better because it handles NULL values correctly and does not require typing in the column names twice.
As I've written in the beginning of my answer: I like @erwin-brandstetter 's solution, but wanted to show a solution .... Upon finding the performance benefits though, I like the USING solution better, especially for large tables. I added an example that shows how to deal with NULL values.
Very nice, especially the possibility to have a look first. To check for NULL values in the data columns, I generated a T1.col = T2.col OR (T1.col IS NULL AND T2.col IS NULL) criterion for each column, based on the \dS output of my table. Now I can add my primary key constraint.
Thanks, this proved much faster than other solutions. I gave up after 1 hour for some of the versions out there, this was done almost instantly
Helpful solution for me as I could visually check the delete list prior to execution.

Juan Carlos Oropeza · Accepted Answer · 2016-03-04 18:46:17Z

70

If you have no other unique identifier, you can use ctid:

delete from mytable where exists (select 1 from mytable t2 where t2.name = mytable.name and t2.address = mytable.address and t2.zip = mytable.zip and t2.ctid > mytable.ctid );

It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete like this is one important reason why.

edited Mar 4, 2016 at 18:46

Juan Carlos Oropeza

48.4k14 gold badges87 silver badges128 bronze badges

answered Nov 6, 2014 at 0:22

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

10 Comments

moe Over a year ago

i don't have any field called ctid in my table can you explain where you got this? thanks

wildplasser Over a year ago

ctid is a hidden field. It does not show up when you retrieve the table definition. It is a kind of internal row number.

Juan Carlos Oropeza Over a year ago

where not exists will delete the rows without duplicates. Should be where exists (select 1 `

isapir Over a year ago

@GordonLinoff - Thanks for clarifying. I know that it's off-topic; that's what OT: stands for in the prefix of my question ;)

Edward Over a year ago

In my small table I did: select ctid, * from table. ctid was represented as (0,1), (0,2), etc. So I was able to do a simple delete statement for the duplicate row: delete from table where ctid = '(0,1)'

|

Erwin Brandstetter · Accepted Answer · 2024-10-19 03:13:36Z

In a perfect world, every table has a unique identifier of some sort.
In the absence of any unique column (or combination thereof), use the ctid column:

DELETE FROM tbl WHERE ctid NOT IN ( SELECT min(ctid) -- ctid is NOT NULL by definition FROM tbl GROUP BY name, address, zipcode); -- list columns defining duplicates

Disclaimer:

ctid is an implementation detail of Postgres, it's not in the SQL standard and can change between major versions without warning (even if that's very unlikely). Its values can change between commands due to background processes or concurrent write operations (but not within the same command).

Careful with table inheritance or partitioning. Then there can be multiple physical tables involved and ctid is not unique within the scope. You might use the keyword ONLY (available for SELECT, UPDATE, and DELETE) to prevent descending down the hierarchy. Or involve the tableoid additionally. But that depends on what you want to achieve exactly. See:

The above query is short, conveniently listing column names only once. NOT IN (SELECT ...) is a tricky query style when NULL values can be involved, but the system column ctid is never NULL. See:

Find records where join doesn't exist

Using EXISTS as demonstrated by @Gordon is typically faster. So is a self-join with the USING clause like @isapir added later. Both should result in the same query plan.

Important difference: These other queries treat NULL values as not equal, while GROUP BY (or DISTINCT or DISTINCT ON ()) treats NULL values as equal. Does not matter for columns defined NOT NULL. Else, depending on your definition of "duplicate", you'll need one approach or the other. Or use IS NOT DISTINCT FROM to compare values (which may exclude some indexes).

The target of a DELETE statement cannot be the CTE, only the underlying table. That's a spillover from SQL Server - as is your whole approach.

I like this solution because it's very concise. Any thoughts about the performance of the solution that I posted below? stackoverflow.com/a/46775289/968244
I was actually able to test it. I have a table with about 350k rows and it had 39 duplicates over 7 columns with no indices. I tried the GROUP BY solution first and it was taking over 30 seconds so I killed it. I then tried the USING solution and it completed in about 16 seconds.
@isapir: Like I mentioned back in 2014: NOT IN is conveniently short syntax, but EXISTS is faster. (Same as your completely valid query with the USING clause.) But there is a subtle difference. I added a note above.

Bruno Calza · Accepted Answer · 2014-11-06 11:14:28Z

11

Here is what I came up with, using a group by

DELETE FROM mytable WHERE id NOT in ( SELECT MIN(id) FROM mytable GROUP BY name, address, zipcode )

It deletes the duplicates, preserving the oldest record that has duplicates.

edited Nov 6, 2014 at 11:14

answered Nov 6, 2014 at 0:29

Bruno Calza

2,8002 gold badges25 silver badges26 bronze badges

3 Comments

moe Over a year ago

i don't have id in my table, this is netezza database they don't have auto-increment numbers like sql server

Bruno Calza Over a year ago

does it have another column that uniquely identifies rows?

Erwin Brandstetter Over a year ago

The HAVING clause is noise for this query. The count for every existing id is >= 1 in any case. You can remove it.

Vivek S. · Accepted Answer · 2016-08-22 08:54:21Z

We can use a window function for very effective removal of duplicate rows:

DELETE FROM tab WHERE id IN (SELECT id FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), id FROM tab) x WHERE x.row_number > 1);

Some PostgreSQL's optimized version (with ctid):

DELETE FROM tab WHERE ctid = ANY(ARRAY(SELECT ctid FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), ctid FROM tab) x WHERE x.row_number > 1));

Joe Murray · Accepted Answer · 2018-03-09 22:19:40Z

The valid syntax is specified at http://www.postgresql.org/docs/current/static/sql-delete.html

I would ALTER your table to add a unique auto-incrementing primary key id so that you can run a query like the following which will keep the first of each set of duplicates (ie the one with the lowest id). Note that adding the key is a bit more complicated in Postgres than some other DBs.

DELETE FROM mytable d USING ( SELECT min(id), name, address, zip FROM mytable GROUP BY name, address, zip HAVING COUNT() > 1 ) AS k WHERE d.id <> k.id AND d.name=k.name AND d.address=k.address AND d.zip=k.zip;

James Risner · Accepted Answer · 2022-10-22 23:33:09Z

To remove duplicates (keep only one entry) from a table "tab" where data looks like this:

fk_id_1	fk_id_2
12	32
12	32
12	32
15	37
15	37

You can do this:

DELETE FROM tab WHERE ctid IN (SELECT ctid FROM (SELECT ctid, fk_id_1, fk_id_2, row_number() OVER (PARTITION BY fk_id_1, fk_id_2 ORDER BY fk_id_1) AS rnum FROM tab) t WHERE t.rnum > 1);

Where ctid is the physical location of the row within its table (therefore, a row identifier) and row_number is a window function that assigns a sequential integer to each row in a result set.

PARTITION groups the result set and the sequential integer is restarted for every group.

wildplasser · Accepted Answer · 2017-10-21 16:09:29Z

If you want a unique identifier for every row, you could just add one (a serial, or a guid), and treat it like a surrogate key.

CREATE TABLE thenames ( name text not null , address text not null , zipcode text not null ); INSERT INTO thenames(name,address,zipcode) VALUES ('James', 'main street', '123' ) ,('James', 'main street', '123' ) ,('James', 'void street', '456') ,('Alice', 'union square' , '123') ; SELECT*FROM thenames;

 -- add a surrogate key ALTER TABLE thenames ADD COLUMN seq serial NOT NULL PRIMARY KEY ; SELECT*FROM thenames; DELETE FROM thenames del WHERE EXISTS( SELECT*FROM thenames x WHERE x.name=del.name AND x.address=del.address AND x.zipcode=del.zipcode AND x.seq < del.seq ); -- add the unique constrain,so that new dupplicates cannot be created in the future ALTER TABLE thenames ADD UNIQUE (name,address,zipcode) ; SELECT*FROM thenames;

Chad Crowe · Accepted Answer · 2017-02-08 15:39:44Z

From the documentation delete duplicate rows

A frequent question in IRC is how to delete rows that are duplicates over a set of columns, keeping only the one with the lowest ID. This query does that for all rows of tablename having the same column1, column2, and column3.

DELETE FROM tablename WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum FROM tablename) t WHERE t.rnum > 1);

Sometimes a timestamp field is used instead of an ID field.

Ansih Mukherjee · Accepted Answer · 2021-05-28 21:28:56Z

For smaller tables, we can use rowid pseudo column to delete duplicate rows.

You can use this query below:

Delete from table1 t1 where t1.rowid > (select min(t2.rowid) from table1 t2 where t1.column = t2. column)

Collectives™ on Stack Overflow

How to delete duplicate rows without unique identifier

10 Answers 10

9 Comments

10 Comments

3 Comments

3 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

9 Comments

10 Comments

3 Comments

3 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Linked

Related