2

I have a taccounts table with columns like account_id(PK), login_name, password, last_login. Now I have to remove some duplicate entries according to a new business logic. So, duplicate accounts will be with either same email or same (login_name & password). The account with the latest login must be preserved.

Here are my attempts (some email values are null and blank)

DELETE FROM taccounts WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN ( SELECT MAX(last_login) FROM taccounts WHERE email is not null and char_length(trim(both ' ' from email))>0 GROUP BY lower(trim(both ' ' from email))) 

Similarly for login_name and password

DELETE FROM taccounts WHERE last_login NOT IN ( SELECT MAX(last_login) FROM taccounts GROUP BY login_name, password) 

Is there any better way or any way to combine these two separate queries?

Also some other table have account_id as foreign key. How to update this change for those tables?` I am using PostgreSQL 9.2.1

EDIT: Some of the email values are null and some of them are blank(''). So, If two accounts have different login_name & password and their emails are null or blank, then they must be considered as two different accounts.

8
  • What is the roughly estimated percentage of duplicates according to the new rules? Is account_id the primary key? What version of PostgreSQL? Can you afford to lock the table for some time (no concurrent access)? Commented Mar 30, 2013 at 11:04
  • @ErwinBrandstetter There are about 4500 rows and about 430 of them are unique. Yes account_id is the primary key. Postgre ver 1.16.0. Locking is actually not required as I am working on some migration thing Commented Mar 30, 2013 at 11:07
  • Please run SELECT version() in your database. 1.16 is probably the version of pgAdmin which you seem to be using. And edit your question with the additional (essential) information. Commented Mar 30, 2013 at 11:10
  • @ErwinBrandstetter I have edited the question with Pk and version info Commented Mar 30, 2013 at 11:15
  • More detail: can last_login happen to be the same for a pair of dupes? And can login_name and password also be NULL. And how to deal with that? Commented Mar 30, 2013 at 12:32

2 Answers 2

1

If most of the rows are deleted (mostly dupes) and the table fits into RAM, consider this route:

  1. SELECT surviving rows into a temporary table.
  2. Reroute FK references to survivors
  3. DELETE all rows from the base table.
  4. Re-INSERT survivors.

1a. Distill surviving rows

CREATE TEMP TABLE tmp AS SELECT DISTINCT ON (login_name, password) * FROM ( SELECT DISTINCT ON (email) * FROM taccounts ORDER BY email, last_login DESC ) sub ORDER BY login_name, password, last_login DESC; 

About DISTINCT ON:

To identify duplicates for two different criteria, use a subquery to apply the two rules one after the other. The first step preserves the account with the latest last_login, so this is "serializable".

Inspect results and test for plausibility.

SELECT * FROM tmp; 

Temporary tables are dropped automatically at the end of a session. In pgAdmin (which you seem to be using) the session lives as long as the editor window is open.

1b. Alternative query for updated definition of "duplicates"

SELECT * FROM taccounts t WHERE NOT EXISTS ( SELECT FROM taccounts t1 WHERE ( NULLIF(t1.email, '') = t.email OR (NULLIF(t1.login_name, ''), NULLIF(t1.password, '')) = (t.login_name, t.password)) AND (t1.last_login, t1.account_id) > (t.last_login, t.account_id) ); 

This doesn't treat NULL or empty string ('') as identical in any of the "duplicate" columns.

The row expression (t1.last_login, t1.account_id) takes care of the possibility that two dupes could share the same last_login. The one with the bigger account_id is chosen in this case - which is unique, since it is the PK.

2a. How to identify all incoming FKs

SELECT c.confrelid::regclass::text AS referenced_table , c.conname AS fk_name , pg_get_constraintdef(c.oid) AS fk_definition FROM pg_attribute a JOIN pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum) WHERE c.confrelid = 'taccounts'::regclass -- (schema-qualified) table name AND c.contype = 'f' ORDER BY 1, contype DESC; 

Only building on the first column of the foreign key. More about that:

Or inspect the Dependents rider in the right hand window of the object browser of pgAdmin after selecting the table taccounts.

2b. Reroute to new primary

If you have tables referencing taccounts (incoming foreign keys to taccounts) you will want to update all those fields, before you delete the dupes.
Reroute all of them to the new primary row:

UPDATE referencing_tbl r SET referencing_column = tmp.reference_column FROM tmp JOIN taccounts t1 USING (email) WHERE r.referencing_column = t1.referencing_column AND referencing_column IS DISTINCT FROM tmp.reference_column; UPDATE referencing_tbl r SET referencing_column = tmp.reference_column FROM tmp JOIN taccounts t2 USING (login_name, password) WHERE r.referencing_column = t1.referencing_column AND referencing_column IS DISTINCT FROM tmp.reference_column; 

3. & 4. Go in for the kill

Now, dupes are not referenced any more. Go in for the kill.

ALTER TABLE taccounts DISABLE TRIGGER ALL; DELETE FROM taccounts; VACUUM taccounts; INSERT INTO taccounts SELECT * FROM tmp; ALTER TABLE taccounts ENABLE TRIGGER ALL; 

Disable all triggers for the duration of the operation. This avoids checking for referential integrity during the operation. Everything should be fine once you re-activate triggers. We took care of all incoming FKs above. Outgoing FKs are guaranteed to be sound, since you have no concurrent write access and all values have been there before.

Sign up to request clarification or add additional context in comments.

9 Comments

Why not TRUNCATE TABLE taccounts; instead of DELETE + VACUUM ?
+ some of the references will be broken with DISABLE TRIGGER ALL
Interesting. Pardon me if I am wrong but isn't this will eliminate accounts with blank or null emails with a single account? If two accounts have different login_name and password and email is blank or null they should be different accounts
@IgorRomanchenko: since there are only 4500 rows, the performance benefit from TRUNCATE is marginal or non-existing, since DELETE is generally faster for small tables. Also, TRUNCATE is more invasive.
Thanks for such a great well referenced answer. I wish I could give more than 1 vote
|
1

In addition to Erwin's excellent answer, it can often be useful to create in intermediate link-table that relates the old keys with the new ones.

DROP SCHEMA tmp CASCADE; CREATE SCHEMA tmp ; SET search_path=tmp; CREATE TABLE taccounts ( account_id SERIAL PRIMARY KEY , login_name varchar , email varchar , last_login TIMESTAMP ); -- create some fake data INSERT INTO taccounts(last_login) SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs ; UPDATE taccounts SET login_name = 'User_' || (account_id %10)::text , email = 'Joe' || (account_id %9)::text || '@somedomain.tld' ; SELECT * FROM taccounts; -- -- Create (temp) table linking old id <--> new id -- After inspection this table can be used as a source for the FK updates -- and for the final delete. -- CREATE TABLE update_ids AS WITH pairs AS ( SELECT one.account_id AS old_id , two.account_id AS new_id FROM taccounts one JOIN taccounts two ON two.last_login > one.last_login AND ( two.email = one.email OR two.login_name = one.login_name) ) SELECT old_id,new_id FROM pairs pp WHERE NOT EXISTS ( SELECT * FROM pairs nx WHERE nx.old_id = pp.old_id AND nx.new_id > pp.new_id ) ; SELECT * FROM update_ids ; UPDATE other_table_with_fk_to_taccounts dst SET account_id. = ids.new_id FROM update_ids ids WHERE account_id. = ids.old_id ; DELETE FROM taccounts del WHERE EXISTS ( SELECT * FROM update_ids ex WHERE ex.old_id = del.account_id ); SELECT * FROM taccounts; 

Yet another way to accomplish the same is to add a column with a pointer to the preferred key to the table itself and use that for your updates and deletes.

ALTER TABLE taccounts ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id) ; -- find the *better* records for each record. UPDATE taccounts dst SET better_id = src.account_id FROM taccounts src WHERE src.login_name = dst.login_name AND src.last_login > dst.last_login AND src.email IS NOT NULL AND NOT EXISTS ( SELECT * FROM taccounts nx WHERE nx.login_name = dst.login_name AND nx.email IS NOT NULL AND nx.last_login > src.last_login ); -- Find records that *do* have an email address UPDATE taccounts dst SET better_id = src.account_id FROM taccounts src WHERE src.login_name = dst.login_name AND src.email IS NOT NULL AND dst.email IS NULL AND NOT EXISTS ( SELECT * FROM taccounts nx WHERE nx.login_name = dst.login_name AND nx.email IS NOT NULL AND nx.last_login > src.last_login ); SELECT * FROM taccounts ORDER BY account_id; UPDATE other_table_with_fk_to_taccounts dst SET account_id = src.better_id FROM update_ids src WHERE dst.account_id = src.account_id AND src.better_id IS NOT NULL ; DELETE FROM taccounts del WHERE EXISTS ( SELECT * FROM taccounts ex WHERE ex.account_id = del.better_id ); SELECT * FROM taccounts ORDER BY account_id; 

2 Comments

Great..One problem though. May be I was not clear in my question, if email matches then I dont have to worry about matching the login_name ,password. Similarly if login_name and password matches then email can be null or blank or whatever
You could add an extra update statement for the case where email matches, but name differs. (there is a semantic problem here: a row could have more than one replacement candidate) The advantage of these methods is that you can inspect the result before you press the big red button... The disadvantage is that the real data can change wrt the temp table.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.