I have a query that is slow on PostgreSQL 9 + PostGIS (running on a Linux computer), despite my efforts to improve its speed.
I basically have a table that contains Polygons called polygones_zones. Actually these polygons come from MultiPolygons that represent land areas broken down into mere polygons in a previous step using the PostGIS function ST_Dump() - the reason for doing this was to improve the speed of the query. This table has about 330,000 rows.
On the other hand, I have a table with Points called sites. This table has about 290,000 rows but when running the query, I currently use only a ~4000 sample of them before using the spatial joint.
I want to know in which areas (the original MultiPolygons) the Points are contained (Postgis function ST_Contains() ).
Everything has a 2D geometry : the Polygons were created using ST_Force2D and on the table with the points, there is a constraint "CHECK (ST_NDims(geom) = 2)". All geometries have been controlled as valid.
I have a GIST index on the geometry columns of both tables and have run a VACUUM ANALYZE and a TABLE REINDEX on both of them just before running the query.
The table definitions :
CREATE TABLE polygones_zones ( pkid serial PRIMARY KEY, category varchar(50) NOT NULL, value varchar (255), origine varchar(255) NOT NULL, origine_id integer, geom geometry(Polygon, 2154) CONSTRAINT enforce_dims_geom CHECK (ST_NDims(geom) = 2) ); CREATE INDEX polygones_zones_geometry ON polygones_zones USING GIST(geom); CREATE INDEX ON polygones_zones (origine, origine_id) ; CREATE TABLE sites ( pkid serial PRIMARY KEY, origine varchar(20) NOT NULL, origine_id varchar(20) NOT NULL, status varchar(255) NOT NULL, geom geometry(Point, 2154), CONSTRAINT enforce_dims_geom CHECK (ST_NDims(geom) = 2) ); CREATE UNIQUE INDEX ON sites (origine, origine_id ) ; CREATE INDEX ON sites (status); CREATE INDEX sites_geometry ON sites USING GIST ( geom ); And now the query :
SELECT sites.pkid, zones.origine , zones.origine_id , FALSE, zones.category, zones.value FROM ( SELECT pkid, geom FROM sites WHERE status = 'selected' ) AS sites JOIN ( SELECT DISTINCT ON (origine_id , origine) category, value, geom, origine_id , origine FROM polygones_zones ) AS zones ON ST_Contains(zones.geom , sites.geom) ; The reason for the "DISTINCT ON (origine_id, origine)" is that when the MultiPolygons have been broken down into Polygons, the identifiers of the original MultiPolygons have been recorder in these two fields. Hence, if a multipolygon has been broken into two overlapping polygons that both contain on of the points in the table sites , without the DISTINCT, the JOIN would return two rows instead of one. Using the DISTINCT ON ensures that for a single original MultiPolygon, the JOIN will only return one row.
Running this query on the sample of Points that is about 1/100 of my data takes about 30 mn. That would mean the query on the whole data set should take about 50 hours ! What can I do to improve its speed ?
Edit : after John Powell's pertinent comment, here is the result of EXPLAIN :
"Nested Loop (cost=525286.40..547598.60 rows=24 width=40)" " Join Filter: ((polygones_zones.geom ~ sites.geom) AND _st_contains(polygones_zones.geom, sites.geom))" " -> Index Scan using sites_statut_idx on sites (cost=0.42..4.44 rows=1 width=36)" " Index Cond: ((statut)::text = 'selected'::text)" " -> Unique (cost=525285.98..527794.86 rows=72658 width=1599)" " -> Sort (cost=525285.98..526122.27 rows=334517 width=1599)" " Sort Key: polygones_zones.origine_id, polygones_zones.origine" " -> Seq Scan on polygones_zones (cost=0.00..30379.17 rows=334517 width=1599)"
CREATE INDEX sites_geometry ON sites_candidats USING GIST ( geom );