Finding nearest point for each distinct category in performant way using big PostGIS table?

Question

I have a table, etablissements with millions of rows, and a geom (st_point) column , spatially indexed (using gist). Those features may have a different category, activite_principale.

I want to find from a given, fixed point the closest companies for 5 or 6 different categories called "activite_principale" (1 closest company per category).

Here's what I did right now:

 (WITH closest_candidates AS ( SELECT ent.id, ent.name, ent.geom FROM geo_data.etablissements ent WHERE ent.activite_principale = '1071C' ORDER BY ent.geom <-> 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry LIMIT 10 ) SELECT id FROM closest_candidates ORDER BY ST_Distance( geom, 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry ) LIMIT 1) UNION ALL (WITH closest_candidates AS ( SELECT ent.id, ent.name, ent.geom FROM geo_data.etablissements ent WHERE ent.activite_principale = '4711D' ORDER BY ent.geom <-> 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry LIMIT 10 ) SELECT id FROM closest_candidates ORDER BY ST_Distance( geom, 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry ) LIMIT 1) -- UNION ALL -- [...] And so on...

I then clustered the etablissements table around the geom spatial index and ran VACUUM ANALYZE geo_data.etablissements;

Here's the result of EXPLAIN ANALYZE after clustering.

The planning is much shorter and the execution too but it's still slow (350-450ms).

I investigated compound index on geom & another text column, but that does not seem to be possible today ?

I use postgres 10 & postgis 2.4.

I don't know how to improve based on those explanations from the query planner. Can I do better performance ?

@PolyGeo I think the GIS community is more focused on those kind of performance issues based on KNN search ? — Pak
– Pak, Commented Apr 13, 2018 at 10:13
dude, what kind of machine are you on? you are running a KNN search on 'millions of rows' to find the 10 nearest points, then running a second KNN search to find the closest one of that set (this is bogus, just find the closest one in the first place!?)...and you are doing that 5 times with a cascading UNION in under half a second? if you don´t want to do this 1000 times in a row, where´s the problem? ,) — geozelot
– geozelot, Commented Apr 13, 2018 at 10:46
@ThingumaBob I invite you to read this article : workshops.boundlessgeo.com/postgis-intro/knn.html. I am not doing two knn searchs. KNN returns approximative results which is fine because it's fast, then filtering by ST_Distance is only to find the closest one in a proven way. I'm using heroku postgres and this is the "standard-0" plan for prototyping, I will upgrade in the future. Do you think the performance is already good ? — Pak
– Pak, Commented Apr 13, 2018 at 12:15
@Pak I see, you're on PostGIS 2.4 (<-> returns true distance from 2.5 on); still, if it's point-to-point, bbox comparison equals true distance (I'm sure that's noted in the article somewhere). I'm going to post a query that I'm curious to see the speed for on your system... — geozelot
– geozelot, Commented Apr 13, 2018 at 12:36
@PolyGeo if someone posted a similar question that dealt with ArcPy code, I don't think it would be treated the same way. The question is not seeking code review - it is seeking advice on the implementation of the logic as it relates to the size of the datasets involved. I don't to harp too much on this subject, as I do with others, but your profile/tags suggest you're more of an ArcGIS expert, so I wonder why would you come to this post to suggest how the answer should be discovered? — Inactivated Account
– Inactivated Account, Commented Apr 13, 2018 at 21:59

geozelot · Accepted Answer · 2018-04-13 12:49:01Z

1

I´m curious, what does your plan say if you run

SELECT DISTINCT ON (a.activite_principale) a.activite_principale, a.id FROM geo_data.etablissements AS a WHERE a.activite_principale IN ('1071C', '4711D', <3rd>, <4th>, <5th>) ORDER BY a.activite_principale, a.geom <-> 'SRID=32632;POINT(363982.8087 5623158.5124)'::geometry

(Fill in <3rd>, <4th>, <5th>)

answered Apr 13, 2018 at 12:49

geozelot

31.5k4 gold badges38 silver badges59 bronze badges

Around 1 s for 4 categories : gist.github.com/kofronpi/4b1fba2d4cdfbfbaa6880e69482274f2

Pak
– Pak

2018-04-13 13:01:36 +00:00
Commented Apr 13, 2018 at 13:01
@Pak doesn´t care much for your clustering maybe?...if I run your query and mine on the same test data (500k points, 4 categories, [name, id, geom] columns in select), with equal results, the plan for your query does include Index (only) scans on all four subqueries (contrary to Heap Scans in your plan; did you rund ANALYZE geo_data.etablissements prior to running the queries?), but still takes about double to triple the time on average than my query above...

geozelot
– geozelot

2018-04-13 13:50:17 +00:00
Commented Apr 13, 2018 at 13:50
I ran analyze after clustering. That's really interesting. Might be the clustering, did you try after clustering by geom ? I will look into it

Pak
– Pak

2018-04-13 14:46:23 +00:00
Commented Apr 13, 2018 at 14:46
@Pak also, try your query without the CTEs and by selecting the nearest neighbor directly (KNN + LIMIT 1) (btw, my table wasn't clustered)

geozelot
– geozelot

2018-04-13 20:56:56 +00:00
Commented Apr 13, 2018 at 20:56
it seems to be better ! I will benchmark it.

Pak
– Pak

2018-04-16 08:05:28 +00:00
Commented Apr 16, 2018 at 8:05

Add a comment |

Stack Exchange Network

Finding nearest point for each distinct category in performant way using big PostGIS table?

1 Answer 1

Hot Network Questions

Finding nearest point for each distinct category in performant way using big PostGIS table?

1 Answer 1

Related

Hot Network Questions