Return to Answer

More explanation of the concepts used and how the query works.

edited Apr 9, 2021 at 15:39

15.3k
6
36
42

Here is a solution using PARTITION BYusing PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:

DELETE FROM dups USING ( SELECT ctid, ( ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...]) ) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate

A subquery is used to mark all rows as duplicates or not, based on whether they share the same "key columns", but not the same ctid, as the "first" one found in the "partition" of rows sharing the same keys.

In other words, "first" is defined as:

min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])

Then, all rows where is_duplicate is true are deleted by their ctid.

From the documentation, ctid represents (emphasis mine):

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Here is a solution using PARTITION BY:

DELETE FROM dups USING ( SELECT ctid, (ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate

Here is a solution using PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:

DELETE FROM dups USING ( SELECT ctid, ( ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...]) ) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate

In other words, "first" is defined as:

min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])

Then, all rows where is_duplicate is true are deleted by their ctid.

From the documentation, ctid represents (emphasis mine):

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Source Link

answered Jun 23, 2020 at 13:44

LeoRochael

15.3k
6
36
42

Here is a solution using PARTITION BY:

DELETE FROM dups USING ( SELECT ctid, (ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate

Collectives™ on Stack Overflow

Return to Answer