1

I have a pretty standard table with a geom column and a gist index on it. I would like to have a fast pagination on this table as it contains millions of rows.

The query involves a ST_Intersects:

SELECT id, geom FROM products WHERE ST_Intersects( geom, ST_MakeEnvelope(%s, %s, %s, %s, 4326) ) 

However, there is no ordering with ST_Intersects, and the OFFSET clause adds a too large penalty.

Is there a way to achieve a fast and robust pagination with spatial geometries?

5
  • This is essentially the same question the poster made a few days ago. See gis.stackexchange.com/questions/374461/… Commented Sep 23, 2020 at 13:17
  • My previous question was about a weird performance w.r.t. the area of a geometry that I solved using EXPLAIN ANALYZE. Here I am asking more generally about implementing constant time pagination with spatial indexes (like keyset pagination). I can rewrite the previous one if you think that's better. Commented Sep 23, 2020 at 13:45
  • what is wrong with the standard solution? (select id ... where id > lastMaxID order by id limit 100) You would have to run the spatial query for each page anyways. Commented Sep 23, 2020 at 14:09
  • It uses an index scan on the id but the spatial index does not kick in, hence the query is more than 10x slower. Commented Sep 23, 2020 at 14:19
  • see this post. It contains ordering by id+0 or to add an offset 0 to prevent the default index from being used Commented Sep 23, 2020 at 14:22

1 Answer 1

4

The spatial index is of less concern here; you want to make sure you have an index on your id column!

Of course you can order the return set of your query consistently by id (or any other way), i.e.

SELECT id, geom FROM products WHERE geom && ST_MakeEnvelope(%s, %s, %s, %s, 4326) ORDER BY id ; 

and use LIMIT/OFFSET as you intended. Inconsistencies can only arise if that table gets rows inserted or updated at previous locations in your key column sequence during pagination requests.


However:

As you noted yourself, this can be a performance bottleneck. The issue here is that you have to rerun the costly computation each time you request the next batch, not so much the LIMIT/OFFSET penalty for large tables (for a simple SELECT on few million rows, you probably won't notice the increasing query time for increasing bounds that much, especially with clustered data and proper indexation). This rules out keyset pagination, too, as the concept here is to utilize the index by a filter clause; the main query still has to be run every time you request a batch.

If you can control the transaction flow, a cursor would be the way to go, e.g.

BEGIN; DECLARE batch CURSOR FOR SELECT id, geom FROM products WHERE geom && ST_MakeEnvelope(%s, %s, %s, %s, 4326) ORDER BY id ; FETCH 100 FROM batch; -- fetches first 100 entries and moves the cursor to the 101st position in the result set MOVE RELATIVE + 100 FROM batch; -- moves the cursor to the 201st position in the result set FETCH 100 FROM batch; -- fetches 100 entries from position 201 to 300 and moves the cursor to the 301st position in the result set CLOSE batch; COMMIT; 

This will return data from any position in consistent time over the result set.

But note that a cursor only lives within a single transaction; your client application needs AUTO-COMMIT turned off, and you will need to manage them yourself.

3
  • 1
    As for the issue that the spatial index doesn't kick in; as mentioned in comments above, the planner likely chooses a seq scan on the (smallish) result set, to avoid the index load; using a (MATERIALIZED for PG12 and above) CTE and then selecting from it via LIMIT/OFFSET as usual can solve this. Commented Sep 23, 2020 at 14:33
  • Thank you for your answer geozelot! As I was afraid, a solution similar to keyset pagination cannot be implemented. I was hoping there was a hackish way to tell the spatial index where to resume its search and not to re-test for intersection. I did not know about MATERIALIZED, thanks for the tip! Commented Sep 23, 2020 at 14:42
  • @ThomasCoquet Let me add: if you get a fraction only of a table with a few million rows, you can simply CREATE OR REPLACE MATERIALIZED VIEW AS (<your_query_with_a_row_number_over_column>); and add an index on it. You can refresh it whenever there is new data, and select from it using LIMIT/OFFSET with likely no noticeable overhead. You save the recomputation, and it is transaction agnostic. Commented Sep 23, 2020 at 14:43

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.