3

I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB... My current query is:

SELECT * FROM tableA EXCEPT SELECT * FROM tableB; 

and

SELECT * FROM tableB EXCEPT SELECT * FROM tableA; 

When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.

I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.

Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours. Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)

Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.

What am I doing wrong here, I can't possibly see why with the index my net diff time is larger than without it?

This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries

EDIT: Here is the schema for the two tables, they are identical except the table name.

CREATE TABLE bulk.blue ( "partA" text NOT NULL, "type" text NOT NULL, "partB" text NOT NULL ) WITH ( OIDS=FALSE ); 
1
  • Reporting the explain analyze would be very helpful and should always be done when you are doing comparative analysis after making changes. For example, it would have shown that you still were not using indices after you added them. BTW, clustering on the index would make it work stupendously faster once you used the correct query. Commented Jun 14, 2011 at 0:47

4 Answers 4

2

In the statements above you are not using the indexes.

You could do something like:

SELECT * FROM tableA a FULL OUTER JOIN tableB b ON a.someID = b.someID 

You could then use the same statement to show which tables had missing values

SELECT * FROM tableA a FULL OUTER JOIN tableB b ON a.someID = b.someID WHERE ISNULL(a.someID) OR ISNULL(b.someID) 

This should give you the rows that were missing in table A OR table B

Sign up to request clarification or add additional context in comments.

11 Comments

someID could be any field, but it should be indexed.
I'm trying to use your example, but with two conditions in the join (I have 3 columns, two combined are unique, the third is not.
Here is the actual query and error: SELECT * From bulk."redNet" r full outer join bulk."blueNet" b on (r.partA=b.partA) and (r.partB=b.partB) where ISNULL(r.type) or ISNULL(b.type); ERROR: column r.parta does not exist
I found my 1 problems. "type" had to be in quotes, and there is no isnull() postgresql function, it needs to be: (expression IS NULL). It's running now, I'll time it and see how long it takes.
Like @lanrat already commented: there is no ISNULL() function in PostgreSQL (or standard SQL). Must be col IS NULL.
|
1

Confirm you indexes are being used (they are likely not in such a generic except statement), but you are not joining against a specified column(s) so likely that lack of explicit join will not make for an optimized query:

http://www.postgresql.org/docs/9.0/static/indexes-examine.html

This will help you view the explain analyze more clearly:

http://explain.depesz.com

Also, make sure you do an analyze on the table after you create the index if you want it to perform well right away:}

1 Comment

Thanks for that second link, its quite helpful.
0

The queries as specified require a comparison of every column of the tables.

For example if tableA and tableB each have five columns then the query is having to compare tableA.col1 to tableB.col1, tableA.col2 to tableB.col2, . . . tableA.col5 to tableB.col5

If there are just few columns that uniquely identify a record instead of all the columnS in the table then joining the tables on the specific columns that uniquely identify a record will improve your performance.

The above statement assumes that a primary key has not been created. If a primary key has been defined to indicated which columns uniquely identify a record then I believe the EXCEPT statement would take that into consideration.

Comments

0
  • What kind of index did you apply? Indexes are only useful to improve WHERE conditions. If you're doing a select *, you're grabbing all the fields and the index is probably not doing anything, but taking up space, and adding a little more processing behind the scenes for the db-engine to compare the query to the index cache.

  • Instead of SELECT *, you can try selecting your unique fields and create an index for those unique fields

  • You can also use an OUTER JOIN to show results from both tables that did not match on the unique fields
  • You may want to consider is clustering your tables
  • What version of Postgres are you running?
  • When was the last time you vacuumed?

Other than the above, 13GB is pretty large, so you'll want to check your config settings. It shouldn't take hours to run that, unless you don't have enough memory on your system.

2 Comments

There are three columns, all three are indexed. If all the columns are indexed are you saying "Select *" does not use the index while "Select col1, col2, col3" will use the indexes?
You'll have to look at the query plan for sure, but yes, that's what I was saying, but Postgres is an advanced database, so it wouldn't surprise me if it did lookups on those three columns. It'd be more helpful to dump the query plan output into pastebin.com

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.