How can I speed up a diff between tables?

Question

I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB... My current query is:

SELECT * FROM tableA EXCEPT SELECT * FROM tableB;

and

SELECT * FROM tableB EXCEPT SELECT * FROM tableA;

When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.

I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.

Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours. Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)

Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.

What am I doing wrong here, I can't possibly see why with the index my net diff time is larger than without it?

This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries

EDIT: Here is the schema for the two tables, they are identical except the table name.

CREATE TABLE bulk.blue ( "partA" text NOT NULL, "type" text NOT NULL, "partB" text NOT NULL ) WITH ( OIDS=FALSE );

Reporting the explain analyze would be very helpful and should always be done when you are doing comparative analysis after making changes. For example, it would have shown that you still were not using indices after you added them. BTW, clustering on the index would make it work stupendously faster once you used the correct query. — Seth Robertson
– Seth Robertson, Commented Jun 14, 2011 at 0:47

Colin · Accepted Answer · 2011-06-14 00:28:23Z

2

In the statements above you are not using the indexes.

You could do something like:

SELECT * FROM tableA a FULL OUTER JOIN tableB b ON a.someID = b.someID

You could then use the same statement to show which tables had missing values

SELECT * FROM tableA a FULL OUTER JOIN tableB b ON a.someID = b.someID WHERE ISNULL(a.someID) OR ISNULL(b.someID)

This should give you the rows that were missing in table A OR table B

answered Jun 14, 2011 at 0:28

Colin

2,03113 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Colin Over a year ago

someID could be any field, but it should be indexed.

lanrat Over a year ago

I'm trying to use your example, but with two conditions in the join (I have 3 columns, two combined are unique, the third is not.

lanrat Over a year ago

Here is the actual query and error: SELECT * From bulk."redNet" r full outer join bulk."blueNet" b on (r.partA=b.partA) and (r.partB=b.partB) where ISNULL(r.type) or ISNULL(b.type); ERROR: column r.parta does not exist

lanrat Over a year ago

I found my 1 problems. "type" had to be in quotes, and there is no isnull() postgresql function, it needs to be: (expression IS NULL). It's running now, I'll time it and see how long it takes.

Erwin Brandstetter Over a year ago

Like @lanrat already commented: there is no ISNULL() function in PostgreSQL (or standard SQL). Must be col IS NULL.

|

qooleot · Accepted Answer · 2011-06-14 00:47:22Z

Confirm you indexes are being used (they are likely not in such a generic except statement), but you are not joining against a specified column(s) so likely that lack of explicit join will not make for an optimized query:

http://www.postgresql.org/docs/9.0/static/indexes-examine.html

This will help you view the explain analyze more clearly:

http://explain.depesz.com

Also, make sure you do an analyze on the table after you create the index if you want it to perform well right away:}

dave · Accepted Answer · 2011-06-14 00:41:35Z

The queries as specified require a comparison of every column of the tables.

For example if tableA and tableB each have five columns then the query is having to compare tableA.col1 to tableB.col1, tableA.col2 to tableB.col2, . . . tableA.col5 to tableB.col5

If there are just few columns that uniquely identify a record instead of all the columnS in the table then joining the tables on the specific columns that uniquely identify a record will improve your performance.

The above statement assumes that a primary key has not been created. If a primary key has been defined to indicated which columns uniquely identify a record then I believe the EXCEPT statement would take that into consideration.

vol7ron · Accepted Answer · 2011-06-14 00:43:25Z

What kind of index did you apply? Indexes are only useful to improve WHERE conditions. If you're doing a select *, you're grabbing all the fields and the index is probably not doing anything, but taking up space, and adding a little more processing behind the scenes for the db-engine to compare the query to the index cache.
Instead of SELECT *, you can try selecting your unique fields and create an index for those unique fields
You can also use an OUTER JOIN to show results from both tables that did not match on the unique fields
You may want to consider is clustering your tables
What version of Postgres are you running?
When was the last time you vacuumed?

Other than the above, 13GB is pretty large, so you'll want to check your config settings. It shouldn't take hours to run that, unless you don't have enough memory on your system.

There are three columns, all three are indexed. If all the columns are indexed are you saying "Select *" does not use the index while "Select col1, col2, col3" will use the indexes?
You'll have to look at the query plan for sure, but yes, that's what I was saying, but Postgres is an advanced database, so it wouldn't surprise me if it did lookups on those three columns. It'd be more helpful to dump the query plan output into pastebin.com

Collectives™ on Stack Overflow

How can I speed up a diff between tables?

4 Answers 4

11 Comments

1 Comment

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

11 Comments

1 Comment

Comments

2 Comments

Linked

Related