Skip to main content
More explanation of the concepts used and how the query works.
Source Link
LeoRochael
  • 15.3k
  • 6
  • 36
  • 42

Here is a solution using PARTITION BYusing PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:

DELETE FROM dups USING ( SELECT ctid, ( ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...]) ) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate 

A subquery is used to mark all rows as duplicates or not, based on whether they share the same "key columns", but not the same ctid, as the "first" one found in the "partition" of rows sharing the same keys.

In other words, "first" is defined as:

  • min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])

Then, all rows where is_duplicate is true are deleted by their ctid.

From the documentation, ctid represents (emphasis mine):

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Here is a solution using PARTITION BY:

DELETE FROM dups USING ( SELECT ctid, (ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate 

Here is a solution using PARTITION BY and the virtual ctid column, which is works like a primary key, at least within a single session:

DELETE FROM dups USING ( SELECT ctid, ( ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...]) ) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate 

A subquery is used to mark all rows as duplicates or not, based on whether they share the same "key columns", but not the same ctid, as the "first" one found in the "partition" of rows sharing the same keys.

In other words, "first" is defined as:

  • min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])

Then, all rows where is_duplicate is true are deleted by their ctid.

From the documentation, ctid represents (emphasis mine):

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. A primary key should be used to identify logical rows.

Source Link
LeoRochael
  • 15.3k
  • 6
  • 36
  • 42

Here is a solution using PARTITION BY:

DELETE FROM dups USING ( SELECT ctid, (ctid != min(ctid) OVER (PARTITION BY key_column1, key_column2 [...])) AS is_duplicate FROM dups ) dups_find_duplicates WHERE dups.ctid == dups_find_duplicates.ctid AND dups_find_duplicates.is_duplicate