PostgreSQL Removing duplicates

Question

I am working on postgres query to remove duplicates from a table. The following table is dynamically generated and I want to write a select query which will remove the record if the first row has duplicate values.

The table looks something like this

Ist col 2nd col 4 62 6 34 5 26 5 12

I want to write a select query which remove either row 3 or 4.

@Hack-R I can take the count but how can I remove the row?? Sorry if this is a stupid question — Uasthana
– Uasthana, Commented Oct 8, 2016 at 4:41

Zegarek · Accepted Answer · 2022-11-30 10:08:18Z

10

There is no need for an intermediate table:

delete from df1 where ctid not in (select min(ctid) from df1 group by first_column);

If you are deleting many rows from a large table, the approach with an intermediate table is probably faster.

If you just want to get unique values for one column, you can use:

select distinct on (first_column) * from the_table order by first_column;

Or simply

select first_column, min(second_column) from the_table group by first_column;

edited Nov 30, 2022 at 10:08

Zegarek

29.9k5 gold badges27 silver badges32 bronze badges

answered Oct 8, 2016 at 6:51

user330315

Sign up to request clarification or add additional context in comments.

4 Comments

user330315 Over a year ago

@Uasthana: hmm, you said "to remove duplicates from a table".

Alien Life Form Over a year ago

This will delete rows 1 and 2 and 4... I think he just wants to delete row 4.

Mariano Anaya Over a year ago

Isn't it having count(*) >= 1 ? As it's now, it will also delete non-duplicated records too (with only one instance).

poshest Over a year ago

Yes, it should be having count(*) >= 1 @MarianoAnaya. Better is to just remove the having altogether. I nearly deleted rows that I needed with it. Please remove the having clause, @a_horse.

Hack-R · Accepted Answer · 2016-10-08 04:58:13Z

 select count(first) as cnt, first, second from df1 group by first having(count(first) = 1)

if you want to keep one of the rows (sorry, I initially missed it if you wanted that):

 select first, min(second) from df1 group by first

Where the table's name is df1 and the columns are named first and second.

You can actually leave off the count(first) as cnt if you want.

At the risk of stating the obvious, once you know how to select the data you want (or don't want) the delete the records any of a dozen ways is simple.

If you want to replace the table or make a new table you can just use create table as for the deletion:

 create table tmp as select count(first) as cnt, first, second from df1 group by first having(count(first) = 1); drop table df1; create table df1 as select * from tmp;

or using DELETE FROM:

DELETE FROM df1 WHERE first NOT IN (SELECT first FROM tmp);

You could also use select into, etc, etc.

See the comment above "how can I remove the row?". Also in question "remove duplicates from a table"

wildplasser · Accepted Answer · 2016-10-08 12:27:46Z

if you want to SELECT unique rows:

SELECT * FROM ztable u WHERE NOT EXISTS ( -- There is no other record SELECT * FROM ztable x WHERE x.id = u.id -- with the same id AND x.ctid < u.ctid -- , but with a different(lower) "internal" rowid ); -- so u.* must be unique

if you want to SELECT the other rows, which were suppressed in the previous query:

SELECT * FROM ztable nu WHERE EXISTS ( -- another record exists SELECT * FROM ztable x WHERE x.id = nu.id -- with the same id AND x.ctid < nu.ctid -- , but with a different(lower) "internal" rowid );

if you want to DELETE records, making the table unique (but keeping one record per id):

DELETE FROM ztable d WHERE EXISTS ( -- another record exists SELECT * FROM ztable x WHERE x.id = d.id -- with the same id AND x.ctid < d.ctid -- , but with a different(lower) "internal" rowid );

Just out of curiosity, if there is a need to preserve rows based on more conditions, and not just delete random (which I believe ctid is doing) then approach with ctid shouldn't be used, right? I mean that it's not stable on the long run.
ctid is used as a last resort, if no other colums are available to discriminate between the various candidates for deletion or selection. (other DBMS's have similar pseudo columns, with different names) In this particular case second_col could have been used, keeping only the lowest (or highest).
There is nothing wrong with my answner(s). there might be something wong with (the way you pose) your question, though.

Uasthana · Accepted Answer · 2016-10-08 06:33:13Z

So basically I did this

 create temp t1 as select first, min (second) as second from df1 group by first select * from df1 inner join t1 on t1.first = df1.first and t1.second = df1.second

Its a satisfactory answer. Thanks for your help @Hack-R

Collectives™ on Stack Overflow

PostgreSQL Removing duplicates

4 Answers 4

4 Comments

3 Comments

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

3 Comments

Comments

Linked

Related