PostGIS Spatial Relationships query performance

Question

I have a very large dataset contains over 700 million points and a polygon dataset as a buffer zone.

My task is to extract all points inside the buffer zone and create a new table.

Below is my code. I test it with a small point dataset and it works fine.

create table schema1.result as select point.* from schema1.site as point, schema2.buffer as poly Where ST_Intersects(point.geo_loc,poly.wkb_geometry);

Unfortunately, the query lasted for 1 day and showed no signs to finish.

Is there any advice to optimise my code to speed up the query?

Update: The output of Explain

"Nested Loop (cost=0.41..17773703.88 rows=6789472 width=208)"

" -> Seq Scan on buffer poly (cost=0.00..18.50 rows=850 width=32)"

" -> Index Scan using idx_site on site point (cost=0.41..20902.23 rows=799 width=208)"

" Index Cond: (geo_loc && poly.wkb_geometry)"

" Filter: st_intersects(geo_loc, poly.wkb_geometry)"

"JIT:"

" Functions: 6"

" Options: Inlining true, Optimization true, Expressions true, Deforming true"

Do you have a GiST index on point.geo_loc and/or poly.wkb_geometry? Could you show the output of explain with the query? — Mike T
– Mike T, Commented Oct 28, 2020 at 23:11
Hi @MikeT, Thanks for your reply. I believe I have created index on both point.geo_loc and poly.wkb_geometry. Also, I have added the Explain output to the original post. — Scorpioooooon21
– Scorpioooooon21, Commented Oct 28, 2020 at 23:25
have you ANALYZEd the tables after creating the indices? Also, someone else may correct me, but you could add AND ST_DWITHIN(point.geo_loc,poly.wkb_geometry, 1) to the where and it may use the spatial indices more efficiently/at all. — Hugh_Kelley
– Hugh_Kelley, Commented Oct 28, 2020 at 23:55
What is the min, max and avg ST_NPoints of your polygons? How many are there? Do they all have a regular shape? — geozelot
– geozelot, Commented Oct 29, 2020 at 8:00
And most importantly, do they overlap? Since, if a point may be found in more than one poly, a JOIN will fetch it for each match; you'd need to add a DISTINCT which will be significantly slower. In that case, an EXISTS may indeed be the better choice. If, however, a point always only intersects with one polygon, a JOIN is likely the better plan. — geozelot
– geozelot, Commented Oct 29, 2020 at 8:25

Hugh_Kelley · Accepted Answer · 2020-10-30 13:11:04Z

6

I think what's going on here is that when you write:

from schema1.site as point, schema2.buffer as poly

PostgreSQL is doing a CROSS JOIN between the two tables. When multiple tables are listed in the FROM clause postgres uses a CROSS JOIN source that results in a table with a number of rows equal to the Cartesian product of the two tables source.

to avoid that you can use WHERE EXISTS as:

SELECT * FROM schema1.site AS point_table WHERE EXISTS( SELECT 1 FROM schema2.buffer AS poly_table where ST_Intersects(point_table.geom, poly_table.geom) );

This allows PostgreSQL to calculate only the spatial join and not have to build a temporary table of the full cross join which takes forever.

edited Oct 30, 2020 at 13:11

answered Oct 29, 2020 at 0:14

Hugh_Kelley

7792 gold badges7 silver badges22 bronze badges

Note that the assumption is to create a new table; PG should handle this elegantly without writing a temporary merge table first; also, there are no hash join nodes in the plan. The CROSS JOIN should get translated to an (usually much faster) INNER JOIN with the given WHERE filter, but if the polygons overlap, the result will have (costly) duplicates. This also invokes the subquery 700M times, run against the less effective polygon index. I do love myself an EXISTS whenever possible, but I suspect it is not going to beat a JOIN if OP can save the DISTINCT.

geozelot
– geozelot

2020-10-29 08:58:26 +00:00
Commented Oct 29, 2020 at 8:58
If no join qualifier is specified, the default is to compute an INNER join.

dr_jts
– dr_jts

2020-10-29 14:54:39 +00:00
Commented Oct 29, 2020 at 14:54
@dr_jts is my answer above not an improvement over the original method used in the question or are you pointing out that it's not as good as the answer you posted? Also, I'm not seeing anything on google for "postgres join qualifier" is there a specific definition of this term?

Hugh_Kelley
– Hugh_Kelley

2020-10-29 15:10:13 +00:00
Commented Oct 29, 2020 at 15:10
@Hugh_Kelley I was just pointing out that if no JOIN type is specified, then the default is INNER. Postgres reference is here: postgresql.org/docs/current/…. And actually I do suspect that using the point table as the driving table is going to be less efficient than the reverse.

dr_jts
– dr_jts

2020-10-29 18:47:17 +00:00
Commented Oct 29, 2020 at 18:47
1

@dr_jts, thanks, not intending to be argumentative, trying to understand what's going on with the query. I need to get a better conceptual understanding of the spatial join operations.

Hugh_Kelley
– Hugh_Kelley

2020-10-30 13:02:10 +00:00
Commented Oct 30, 2020 at 13:02

| Show 1 more comment

dr_jts · Accepted Answer · 2020-10-29 05:34:02Z

It might be faster to use the polygons as the driving table, using the point table index to filter down the large number of records. This allows PostGIS to optimize the ST_Intersects spatial predicate by preparing each polygon.

This can be forced using LATERAL:

SELECT pt.* FROM schema2.buffer AS poly JOIN LATERAL (SELECT * FROM schema1.site) AS pt ON ST_Intersects(poly.wkb_geometry, pt.geo_loc);

If the polygons have a large number of vertices, it can be faster to use ST_Subdivide to fragment them before doing the point query:

WITH poly AS ( SELECT ST_Subdivide(wkb_geometry) AS geom FROM schema2.buffer ) SELECT pt.* FROM poly JOIN LATERAL (SELECT * FROM schema1.site) AS pt ON ST_Intersects(poly.geom, pt.geo_loc);

+1 for subdivision, if applicable here. At least with PG>10, the planner should choose the more effective (larger) index no matter the written JOIN order. That's also what OPs EXPLAIN ANALYZE suggests. — geozelot
– geozelot, Commented Oct 29, 2020 at 9:02
@dr_jts Thanks for your advice. I also suppose that to use the polygons as the driving table can be a bit faster, but somehow this query took a bit longer than answer from Hugh_Kelley. But anyway, LATERAL is quite a novel join method for me to learn. — Scorpioooooon21
– Scorpioooooon21, Commented Nov 4, 2020 at 5:55
Oh, interesting. Relative performance might depend on the nature of the data. — dr_jts
– dr_jts, Commented Nov 4, 2020 at 17:07

Stack Exchange Network

PostGIS Spatial Relationships query performance

2 Answers 2

Hot Network Questions

PostGIS Spatial Relationships query performance

2 Answers 2

Related

Hot Network Questions