Find difference between two big tables in PostgreSQL

Question

I have two similar tables in Postgres with just one 32-byte latin field (simple md5 hash). Both tables have ~30,000,000 rows. Tables have little difference (10-1000 rows are different)

Is it possible with Postgres to find a difference between these tables, the result should be 10-1000 rows I described above.

This is not a real task, I just want to know about how PostgreSQL deals with JOIN-like logic.

look on this How to compare two tables in postgres and this to speed up the diff How can I speed up a diff between tables? — static
– static, Commented Mar 11, 2013 at 2:49

Erwin Brandstetter · Accepted Answer · 2024-04-09 16:01:28Z

EXISTS seems like the best option.

tbl1 is the table with surplus rows in this example:

SELECT * FROM tbl1 WHERE NOT EXISTS (SELECT FROM tbl2 WHERE tbl2.col = tbl1.col);

If you don't know which table has surplus rows or both have, you can either repeat the above query after switching table names, or:

SELECT * FROM tbl1 FULL OUTER JOIN tbl2 USING (col) WHERE tbl2.col IS NULL OR tbl1.col IS NULL;

Overview over basic techniques in a later post:

Select rows which are not present in other table

Aside: The data type uuid is efficient for md5 hashes:

ThomasH · Accepted Answer · 2018-03-20 10:20:06Z

To augment existing answers I use the row() function for the join condition. This allows you to compare entire rows. E.g. my typical query to see the symmetric difference looks like this

select * from tbl1 full outer join tbl2 on row(tbl1) = row(tbl2) where tbl1.col is null or tbl2.col is null

.col can be omitted to test against all columns of table when using ON row() = row()

testing_22 · Accepted Answer · 2021-11-10 22:26:19Z

If you want to find the difference without knowing which table has more rows than other, you can try this option that get all rows present in either tables:

SELECT * FROM A WHERE NOT EXISTS (SELECT * FROM B) UNION SELECT * FROM B WHERE NOT EXISTS (SELECT * FROM A)

0xCAFEBABE · Accepted Answer · 2013-03-11 07:45:08Z

In my experience, NOT IN with a subquery takes a very long time. I'd do it with an inclusive join:

DELETE FROM table1 where ID IN ( SELECT id FROM table1 LEFT OUTER JOIN table2 on table1.hashfield = table2.hashfield WHERE table2.hashfield IS NULL)

And then do the same the other way around for the other table.

Note that NOT IN is different in principal from NOT EXISTS. NULL handling is different, which makes NOT IN more expensive.

Collectives™ on Stack Overflow

Find difference between two big tables in PostgreSQL

4 Answers 4

Comments

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

1 Comment

Linked

Related