I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB... My current query is:
SELECT * FROM tableA EXCEPT SELECT * FROM tableB; and
SELECT * FROM tableB EXCEPT SELECT * FROM tableA; When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.
I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.
Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours. Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)
Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.
What am I doing wrong here, I can't possibly see why with the index my net diff time is larger than without it?
This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries
EDIT: Here is the schema for the two tables, they are identical except the table name.
CREATE TABLE bulk.blue ( "partA" text NOT NULL, "type" text NOT NULL, "partB" text NOT NULL ) WITH ( OIDS=FALSE );
explain analyzewould be very helpful and should always be done when you are doing comparative analysis after making changes. For example, it would have shown that you still were not using indices after you added them. BTW, clustering on the index would make it work stupendously faster once you used the correct query.