Remove duplicates from table based on multiple criteria and persist to other table

Question

I have a taccounts table with columns like account_id(PK), login_name, password, last_login. Now I have to remove some duplicate entries according to a new business logic. So, duplicate accounts will be with either same email or same (login_name & password). The account with the latest login must be preserved.

Here are my attempts (some email values are null and blank)

DELETE FROM taccounts WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN ( SELECT MAX(last_login) FROM taccounts WHERE email is not null and char_length(trim(both ' ' from email))>0 GROUP BY lower(trim(both ' ' from email)))

Similarly for login_name and password

DELETE FROM taccounts WHERE last_login NOT IN ( SELECT MAX(last_login) FROM taccounts GROUP BY login_name, password)

Is there any better way or any way to combine these two separate queries?

Also some other table have account_id as foreign key. How to update this change for those tables?` I am using PostgreSQL 9.2.1

EDIT: Some of the email values are null and some of them are blank(''). So, If two accounts have different login_name & password and their emails are null or blank, then they must be considered as two different accounts.

What is the roughly estimated percentage of duplicates according to the new rules? Is account_id the primary key? What version of PostgreSQL? Can you afford to lock the table for some time (no concurrent access)? — Erwin Brandstetter
– Erwin Brandstetter, Commented Mar 30, 2013 at 11:04
@ErwinBrandstetter There are about 4500 rows and about 430 of them are unique. Yes account_id is the primary key. Postgre ver 1.16.0. Locking is actually not required as I am working on some migration thing — Anupam
– Anupam, Commented Mar 30, 2013 at 11:07
Please run SELECT version() in your database. 1.16 is probably the version of pgAdmin which you seem to be using. And edit your question with the additional (essential) information. — Erwin Brandstetter
– Erwin Brandstetter, Commented Mar 30, 2013 at 11:10
@ErwinBrandstetter I have edited the question with Pk and version info — Anupam
– Anupam, Commented Mar 30, 2013 at 11:15
More detail: can last_login happen to be the same for a pair of dupes? And can login_name and password also be NULL. And how to deal with that? — Erwin Brandstetter
– Erwin Brandstetter, Commented Mar 30, 2013 at 12:32

Erwin Brandstetter · Accepted Answer · 2021-05-15 23:07:00Z

If most of the rows are deleted (mostly dupes) and the table fits into RAM, consider this route:

SELECT surviving rows into a temporary table.
Reroute FK references to survivors
DELETE all rows from the base table.
Re-INSERT survivors.

1a. Distill surviving rows

CREATE TEMP TABLE tmp AS SELECT DISTINCT ON (login_name, password) * FROM ( SELECT DISTINCT ON (email) * FROM taccounts ORDER BY email, last_login DESC ) sub ORDER BY login_name, password, last_login DESC;

About DISTINCT ON:

Select first row in each GROUP BY group?

To identify duplicates for two different criteria, use a subquery to apply the two rules one after the other. The first step preserves the account with the latest last_login, so this is "serializable".

Inspect results and test for plausibility.

SELECT * FROM tmp;

Temporary tables are dropped automatically at the end of a session. In pgAdmin (which you seem to be using) the session lives as long as the editor window is open.

1b. Alternative query for updated definition of "duplicates"

SELECT * FROM taccounts t WHERE NOT EXISTS ( SELECT FROM taccounts t1 WHERE ( NULLIF(t1.email, '') = t.email OR (NULLIF(t1.login_name, ''), NULLIF(t1.password, '')) = (t.login_name, t.password)) AND (t1.last_login, t1.account_id) > (t.last_login, t.account_id) );

This doesn't treat NULL or empty string ('') as identical in any of the "duplicate" columns.

The row expression (t1.last_login, t1.account_id) takes care of the possibility that two dupes could share the same last_login. The one with the bigger account_id is chosen in this case - which is unique, since it is the PK.

2a. How to identify all incoming FKs

SELECT c.confrelid::regclass::text AS referenced_table , c.conname AS fk_name , pg_get_constraintdef(c.oid) AS fk_definition FROM pg_attribute a JOIN pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum) WHERE c.confrelid = 'taccounts'::regclass -- (schema-qualified) table name AND c.contype = 'f' ORDER BY 1, contype DESC;

Only building on the first column of the foreign key. More about that:

Find the referenced table name using table, field and schema name

Or inspect the Dependents rider in the right hand window of the object browser of pgAdmin after selecting the table taccounts.

2b. Reroute to new primary

If you have tables referencing taccounts (incoming foreign keys to taccounts) you will want to update all those fields, before you delete the dupes.
Reroute all of them to the new primary row:

UPDATE referencing_tbl r SET referencing_column = tmp.reference_column FROM tmp JOIN taccounts t1 USING (email) WHERE r.referencing_column = t1.referencing_column AND referencing_column IS DISTINCT FROM tmp.reference_column; UPDATE referencing_tbl r SET referencing_column = tmp.reference_column FROM tmp JOIN taccounts t2 USING (login_name, password) WHERE r.referencing_column = t1.referencing_column AND referencing_column IS DISTINCT FROM tmp.reference_column;

3. & 4. Go in for the kill

Now, dupes are not referenced any more. Go in for the kill.

ALTER TABLE taccounts DISABLE TRIGGER ALL; DELETE FROM taccounts; VACUUM taccounts; INSERT INTO taccounts SELECT * FROM tmp; ALTER TABLE taccounts ENABLE TRIGGER ALL;

Disable all triggers for the duration of the operation. This avoids checking for referential integrity during the operation. Everything should be fine once you re-activate triggers. We took care of all incoming FKs above. Outgoing FKs are guaranteed to be sound, since you have no concurrent write access and all values have been there before.

Why not TRUNCATE TABLE taccounts; instead of DELETE + VACUUM ?
+ some of the references will be broken with DISABLE TRIGGER ALL
Interesting. Pardon me if I am wrong but isn't this will eliminate accounts with blank or null emails with a single account? If two accounts have different login_name and password and email is blank or null they should be different accounts
@IgorRomanchenko: since there are only 4500 rows, the performance benefit from TRUNCATE is marginal or non-existing, since DELETE is generally faster for small tables. Also, TRUNCATE is more invasive.
Thanks for such a great well referenced answer. I wish I could give more than 1 vote

Erwin Brandstetter · Accepted Answer · 2021-05-15 22:46:10Z

In addition to Erwin's excellent answer, it can often be useful to create in intermediate link-table that relates the old keys with the new ones.

DROP SCHEMA tmp CASCADE; CREATE SCHEMA tmp ; SET search_path=tmp; CREATE TABLE taccounts ( account_id SERIAL PRIMARY KEY , login_name varchar , email varchar , last_login TIMESTAMP ); -- create some fake data INSERT INTO taccounts(last_login) SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs ; UPDATE taccounts SET login_name = 'User_' || (account_id %10)::text , email = 'Joe' || (account_id %9)::text || '@somedomain.tld' ; SELECT * FROM taccounts; -- -- Create (temp) table linking old id <--> new id -- After inspection this table can be used as a source for the FK updates -- and for the final delete. -- CREATE TABLE update_ids AS WITH pairs AS ( SELECT one.account_id AS old_id , two.account_id AS new_id FROM taccounts one JOIN taccounts two ON two.last_login > one.last_login AND ( two.email = one.email OR two.login_name = one.login_name) ) SELECT old_id,new_id FROM pairs pp WHERE NOT EXISTS ( SELECT * FROM pairs nx WHERE nx.old_id = pp.old_id AND nx.new_id > pp.new_id ) ; SELECT * FROM update_ids ; UPDATE other_table_with_fk_to_taccounts dst SET account_id. = ids.new_id FROM update_ids ids WHERE account_id. = ids.old_id ; DELETE FROM taccounts del WHERE EXISTS ( SELECT * FROM update_ids ex WHERE ex.old_id = del.account_id ); SELECT * FROM taccounts;

Yet another way to accomplish the same is to add a column with a pointer to the preferred key to the table itself and use that for your updates and deletes.

ALTER TABLE taccounts ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id) ; -- find the *better* records for each record. UPDATE taccounts dst SET better_id = src.account_id FROM taccounts src WHERE src.login_name = dst.login_name AND src.last_login > dst.last_login AND src.email IS NOT NULL AND NOT EXISTS ( SELECT * FROM taccounts nx WHERE nx.login_name = dst.login_name AND nx.email IS NOT NULL AND nx.last_login > src.last_login ); -- Find records that *do* have an email address UPDATE taccounts dst SET better_id = src.account_id FROM taccounts src WHERE src.login_name = dst.login_name AND src.email IS NOT NULL AND dst.email IS NULL AND NOT EXISTS ( SELECT * FROM taccounts nx WHERE nx.login_name = dst.login_name AND nx.email IS NOT NULL AND nx.last_login > src.last_login ); SELECT * FROM taccounts ORDER BY account_id; UPDATE other_table_with_fk_to_taccounts dst SET account_id = src.better_id FROM update_ids src WHERE dst.account_id = src.account_id AND src.better_id IS NOT NULL ; DELETE FROM taccounts del WHERE EXISTS ( SELECT * FROM taccounts ex WHERE ex.account_id = del.better_id ); SELECT * FROM taccounts ORDER BY account_id;

Great..One problem though. May be I was not clear in my question, if email matches then I dont have to worry about matching the login_name ,password. Similarly if login_name and password matches then email can be null or blank or whatever
You could add an extra update statement for the case where email matches, but name differs. (there is a semantic problem here: a row could have more than one replacement candidate) The advantage of these methods is that you can inspect the result before you press the big red button... The disadvantage is that the real data can change wrt the temp table.

Collectives™ on Stack Overflow

Remove duplicates from table based on multiple criteria and persist to other table

2 Answers 2

1a. Distill surviving rows

1b. Alternative query for updated definition of "duplicates"

2a. How to identify all incoming FKs

2b. Reroute to new primary

3. & 4. Go in for the kill

9 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1a. Distill surviving rows

1b. Alternative query for updated definition of "duplicates"

2a. How to identify all incoming FKs

2b. Reroute to new primary

3. & 4. Go in for the kill

9 Comments

2 Comments

Linked

Related