22

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash). Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)

Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.

This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.

1

4 Answers 4

41

EXISTS seems like the best option.

tbl1 is the table with surplus rows in this example:

SELECT * FROM tbl1 WHERE NOT EXISTS (SELECT FROM tbl2 WHERE tbl2.col = tbl1.col); 

If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:

SELECT * FROM tbl1 FULL OUTER JOIN tbl2 USING (col) WHERE tbl2.col IS NULL OR tbl1.col IS NULL; 

Overview over basic techniques in a later post:

Aside: The data type uuid is efficient for md5 hashes:

Sign up to request clarification or add additional context in comments.

Comments

7

To augment existing answers I use the row() function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this

select * from tbl1 full outer join tbl2 on row(tbl1) = row(tbl2) where tbl1.col is null or tbl2.col is null 

1 Comment

.col can be omitted to test against all columns of table when using ON row() = row()
2

If you want to find the difference without knowing which table has more rows than other, you can try this option that get all rows present in either tables:

SELECT * FROM A WHERE NOT EXISTS (SELECT * FROM B) UNION SELECT * FROM B WHERE NOT EXISTS (SELECT * FROM A) 

Comments

-2

In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:

DELETE FROM table1 where ID IN ( SELECT id FROM table1 LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield WHERE table2.hashfield IS NULL) 

And then do the same the other way around for the other table.

1 Comment

Note that NOT IN is different in principal from NOT EXISTS. NULL handling is different, which makes NOT IN more expensive.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.