Improving the speed of a Postgis spatial query

Question

I have a query that is slow on PostgreSQL 9 + PostGIS (running on a Linux computer), despite my efforts to improve its speed.

I basically have a table that contains Polygons called polygones_zones. Actually these polygons come from MultiPolygons that represent land areas broken down into mere polygons in a previous step using the PostGIS function ST_Dump() - the reason for doing this was to improve the speed of the query. This table has about 330,000 rows.

On the other hand, I have a table with Points called sites. This table has about 290,000 rows but when running the query, I currently use only a ~4000 sample of them before using the spatial joint.

I want to know in which areas (the original MultiPolygons) the Points are contained (Postgis function ST_Contains() ).

Everything has a 2D geometry : the Polygons were created using ST_Force2D and on the table with the points, there is a constraint "CHECK (ST_NDims(geom) = 2)". All geometries have been controlled as valid.

I have a GIST index on the geometry columns of both tables and have run a VACUUM ANALYZE and a TABLE REINDEX on both of them just before running the query.

The table definitions :

CREATE TABLE polygones_zones ( pkid serial PRIMARY KEY, category varchar(50) NOT NULL, value varchar (255), origine varchar(255) NOT NULL, origine_id integer, geom geometry(Polygon, 2154) CONSTRAINT enforce_dims_geom CHECK (ST_NDims(geom) = 2) ); CREATE INDEX polygones_zones_geometry ON polygones_zones USING GIST(geom); CREATE INDEX ON polygones_zones (origine, origine_id) ; CREATE TABLE sites ( pkid serial PRIMARY KEY, origine varchar(20) NOT NULL, origine_id varchar(20) NOT NULL, status varchar(255) NOT NULL, geom geometry(Point, 2154), CONSTRAINT enforce_dims_geom CHECK (ST_NDims(geom) = 2) ); CREATE UNIQUE INDEX ON sites (origine, origine_id ) ; CREATE INDEX ON sites (status); CREATE INDEX sites_geometry ON sites USING GIST ( geom );

And now the query :

SELECT sites.pkid, zones.origine , zones.origine_id , FALSE, zones.category, zones.value FROM ( SELECT pkid, geom FROM sites WHERE status = 'selected' ) AS sites JOIN ( SELECT DISTINCT ON (origine_id , origine) category, value, geom, origine_id , origine FROM polygones_zones ) AS zones ON ST_Contains(zones.geom , sites.geom) ;

The reason for the "DISTINCT ON (origine_id, origine)" is that when the MultiPolygons have been broken down into Polygons, the identifiers of the original MultiPolygons have been recorder in these two fields. Hence, if a multipolygon has been broken into two overlapping polygons that both contain on of the points in the table sites , without the DISTINCT, the JOIN would return two rows instead of one. Using the DISTINCT ON ensures that for a single original MultiPolygon, the JOIN will only return one row.

Running this query on the sample of Points that is about 1/100 of my data takes about 30 mn. That would mean the query on the whole data set should take about 50 hours ! What can I do to improve its speed ?

Edit : after John Powell's pertinent comment, here is the result of EXPLAIN :

"Nested Loop (cost=525286.40..547598.60 rows=24 width=40)" " Join Filter: ((polygones_zones.geom ~ sites.geom) AND _st_contains(polygones_zones.geom, sites.geom))" " -> Index Scan using sites_statut_idx on sites (cost=0.42..4.44 rows=1 width=36)" " Index Cond: ((statut)::text = 'selected'::text)" " -> Unique (cost=525285.98..527794.86 rows=72658 width=1599)" " -> Sort (cost=525285.98..526122.27 rows=334517 width=1599)" " Sort Key: polygones_zones.origine_id, polygones_zones.origine" " -> Seq Scan on polygones_zones (cost=0.00..30379.17 rows=334517 width=1599)"

If this is not a typo you have indexed a wrong table CREATE INDEX sites_geometry ON sites_candidats USING GIST ( geom ); — user30184
– user30184, Commented May 16, 2018 at 9:04
It was a typo mistake. Thanks for pointing at it. I changed the post above. — Darth Kangooroo
– Darth Kangooroo, Commented May 16, 2018 at 9:57
I added the result of EXPLAIN. Obviously, it's the DISTINCT that consumes so much time. Running the query without the DISTINCT is achieved in less than 3s ! — Darth Kangooroo
– Darth Kangooroo, Commented May 16, 2018 at 10:01
Besides, I now realize it probably produces a wrong result : if a MultiPolygon is broken down into several separated Polygons, the DISTINCT will only return one of them (the first fetched) and ignore the others, resulting in actually discarding part of the area represented by the MultiPolygon. — Darth Kangooroo
– Darth Kangooroo, Commented May 16, 2018 at 10:08

Vince · Accepted Answer · 2018-05-16 10:34:00Z

Answer : after John Powell's comment, the EXPLAIN showed that it was the DISTINCT that was taking way too much time. It also made me notice that this DISTINCT was producing wrong results.

I have moved the DISTINCT clause so now the query runs in under 4s :

SELECT DISTINCT ON (origine_id , origine) sites.pkid, zones.origine , zones.origine_id , FALSE, zones.category, zones.value FROM ( SELECT pkid, geom FROM sites WHERE status = 'selected' ) AS sites JOIN polygones_zones AS zones ON ST_Contains(zones.geom , sites.geom)

;

Stack Exchange Network

Improving the speed of a Postgis spatial query

1 Answer 1

Hot Network Questions

Improving the speed of a Postgis spatial query

1 Answer 1

Related

Hot Network Questions