2

I have a table, etablissements with millions of rows, and a geom (st_point) column , spatially indexed (using gist). Those features may have a different category, activite_principale.

I want to find from a given, fixed point the closest companies for 5 or 6 different categories called "activite_principale" (1 closest company per category).

Here's what I did right now:

 (WITH closest_candidates AS ( SELECT ent.id, ent.name, ent.geom FROM geo_data.etablissements ent WHERE ent.activite_principale = '1071C' ORDER BY ent.geom <-> 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry LIMIT 10 ) SELECT id FROM closest_candidates ORDER BY ST_Distance( geom, 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry ) LIMIT 1) UNION ALL (WITH closest_candidates AS ( SELECT ent.id, ent.name, ent.geom FROM geo_data.etablissements ent WHERE ent.activite_principale = '4711D' ORDER BY ent.geom <-> 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry LIMIT 10 ) SELECT id FROM closest_candidates ORDER BY ST_Distance( geom, 'SRID=4326;POINT (5.4153978921979125 43.271437384501965)'::geometry ) LIMIT 1) -- UNION ALL -- [...] And so on... 

I then clustered the etablissements table around the geom spatial index and ran VACUUM ANALYZE geo_data.etablissements;

Here's the result of EXPLAIN ANALYZE after clustering.

The planning is much shorter and the execution too but it's still slow (350-450ms).

I investigated compound index on geom & another text column, but that does not seem to be possible today ?

I use postgres 10 & postgis 2.4.

I don't know how to improve based on those explanations from the query planner. Can I do better performance ?

16
  • 5
    @PolyGeo I think the GIS community is more focused on those kind of performance issues based on KNN search ? Commented Apr 13, 2018 at 10:13
  • 1
    dude, what kind of machine are you on? you are running a KNN search on 'millions of rows' to find the 10 nearest points, then running a second KNN search to find the closest one of that set (this is bogus, just find the closest one in the first place!?)...and you are doing that 5 times with a cascading UNION in under half a second? if you don´t want to do this 1000 times in a row, where´s the problem? ,) Commented Apr 13, 2018 at 10:46
  • 1
    @ThingumaBob I invite you to read this article : workshops.boundlessgeo.com/postgis-intro/knn.html. I am not doing two knn searchs. KNN returns approximative results which is fine because it's fast, then filtering by ST_Distance is only to find the closest one in a proven way. I'm using heroku postgres and this is the "standard-0" plan for prototyping, I will upgrade in the future. Do you think the performance is already good ? Commented Apr 13, 2018 at 12:15
  • 1
    @Pak I see, you're on PostGIS 2.4 (<-> returns true distance from 2.5 on); still, if it's point-to-point, bbox comparison equals true distance (I'm sure that's noted in the article somewhere). I'm going to post a query that I'm curious to see the speed for on your system... Commented Apr 13, 2018 at 12:36
  • 3
    @PolyGeo if someone posted a similar question that dealt with ArcPy code, I don't think it would be treated the same way. The question is not seeking code review - it is seeking advice on the implementation of the logic as it relates to the size of the datasets involved. I don't to harp too much on this subject, as I do with others, but your profile/tags suggest you're more of an ArcGIS expert, so I wonder why would you come to this post to suggest how the answer should be discovered? Commented Apr 13, 2018 at 21:59

1 Answer 1

1

I´m curious, what does your plan say if you run

SELECT DISTINCT ON (a.activite_principale) a.activite_principale, a.id FROM geo_data.etablissements AS a WHERE a.activite_principale IN ('1071C', '4711D', <3rd>, <4th>, <5th>) ORDER BY a.activite_principale, a.geom <-> 'SRID=32632;POINT(363982.8087 5623158.5124)'::geometry 

(Fill in <3rd>, <4th>, <5th>)

5
  • Around 1 s for 4 categories : gist.github.com/kofronpi/4b1fba2d4cdfbfbaa6880e69482274f2 Commented Apr 13, 2018 at 13:01
  • @Pak doesn´t care much for your clustering maybe?...if I run your query and mine on the same test data (500k points, 4 categories, [name, id, geom] columns in select), with equal results, the plan for your query does include Index (only) scans on all four subqueries (contrary to Heap Scans in your plan; did you rund ANALYZE geo_data.etablissements prior to running the queries?), but still takes about double to triple the time on average than my query above... Commented Apr 13, 2018 at 13:50
  • I ran analyze after clustering. That's really interesting. Might be the clustering, did you try after clustering by geom ? I will look into it Commented Apr 13, 2018 at 14:46
  • @Pak also, try your query without the CTEs and by selecting the nearest neighbor directly (KNN + LIMIT 1) (btw, my table wasn't clustered) Commented Apr 13, 2018 at 20:56
  • it seems to be better ! I will benchmark it. Commented Apr 16, 2018 at 8:05

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.