3

I have an Events table with 30 million rows. The following query returns in 25 seconds

SELECT DISTINCT "events"."id", "calendars"."user_id" FROM "events" LEFT JOIN "calendars" ON "events"."calendar_id" = "calendars"."id" WHERE "events"."deleted_at" is null AND tstzrange('2016-04-21T12:12:36-07:00', '2016-04-21T12:22:36-07:00') @> lower(time_range) AND ("status" is null or (status->>'pre_processed') IS NULL) 

status is a jsonb column with an index on status->>'pre_processed'. Here are the other indexes that were created on the events table. time_range is of type TSTZRANGE.

CREATE INDEX events_time_range_idx ON events USING gist (time_range); CREATE INDEX events_lower_time_range_index on events(lower(time_range)); CREATE INDEX events_upper_time_range_index on events(upper(time_range)); CREATE INDEX events_calendar_id_index on events (calendar_id) 

I'm definitely out of my comfort zone on this and am trying to reduce the query time. Here's the output of explain analyze

 HashAggregate (cost=7486635.89..7486650.53 rows=1464 width=48) (actual time=26989.272..26989.306 rows=98 loops=1) Group Key: events.id, calendars.user_id -> Nested Loop Left Join (cost=0.42..7486628.57 rows=1464 width=48) (actual time=316.110..26988.941 rows=98 loops=1) -> Seq Scan on events (cost=0.00..7475629.43 rows=1464 width=50) (actual time=316.049..26985.344 rows=98 loops=1) Filter: ((deleted_at IS NULL) AND ((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange @> lower(time_range))) Rows Removed by Filter: 31592898 -> Index Scan using calendars_pkey on calendars (cost=0.42..7.50 rows=1 width=48) (actual time=0.030..0.031 rows=1 loops=98) Index Cond: (events.calendar_id = (id)::text) Planning time: 1.468 ms Execution time: 26989.370 ms 

And here is the explain analyze with the events.deleted_at part of the query removed

HashAggregate (cost=7487382.57..7487398.33 rows=1576 width=48) (actual time=23880.466..23880.503 rows=115 loops=1) Group Key: events.id, calendars.user_id -> Nested Loop Left Join (cost=0.42..7487374.69 rows=1576 width=48) (actual time=16.612..23880.114 rows=115 loops=1) -> Seq Scan on events (cost=0.00..7475629.43 rows=1576 width=50) (actual time=16.576..23876.844 rows=115 loops=1) Filter: (((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange @> lower(time_range))) Rows Removed by Filter: 31592881 -> Index Scan using calendars_pkey on calendars (cost=0.42..7.44 rows=1 width=48) (actual time=0.022..0.023 rows=1 loops=115) Index Cond: (events.calendar_id = (id)::text) 

Planning time: 0.372 ms Execution time: 23880.571 ms

I added the index on the status column. Everything else what already there and I'm unsure how to proceed going forward. Any suggestions on how to get the query time down to a more manageable number?

9
  • 1
    The structures of the events and the calendars table would be helpfull. If you could post the explain analyze output instead of just the explain that could help too. Commented Apr 22, 2016 at 0:48
  • @e4c5 Thanks. I added the explain analyze. I can add the structure later. I mentioned the fields I'm querying on are TSTZRANGE and JSONB. deleted_at is just a timestamp Commented Apr 22, 2016 at 0:59
  • Not sure you need the @> lower(time_range) wouldn't an "overlaps" do the same? where ... @> time_range - that might use the gist index on that column. Also which of the conditions removes the most of the rows? The condition on status, the one on time_range or the one on deleted_at? Commented Apr 22, 2016 at 5:10
  • @a_horse_with_no_name remove the lower actually made the query slower. Status reduces the rows the most (many thousands) Commented Apr 22, 2016 at 20:41
  • Add an index to the calendar_id in the events table and make sure it's indexed in the calendar table. Commented Apr 25, 2016 at 18:53

2 Answers 2

5
+50

The B-tree index on lower(time_range) can only be used for conditions involving the <, <=, =, >= and > operators. The @> operator may rely on these internally, but as far as the planner is concerned, this range check operation is a black box, and so it can't make use of the index.

You will need to reformulate your condition in terms of the B-tree operators, i.e.:

lower(time_range) >= '2016-04-21T12:12:36-07:00' AND lower(time_range) < '2016-04-21T12:22:36-07:00' 
Sign up to request clarification or add additional context in comments.

4 Comments

You're the man. 68ms with that little change. I suspected it was something to do with how I was querying on the time range but I'm pretty out of my element here. It's now using an index scan Index Scan using events_lower_time_range_index on events (cost=0.57..2177.94 rows=5 width=50) (actual time=0.019..0.186 rows=98 loops=1). It says I can award you the bounty in 2 hours. Thanks!
The first time I change the time range, it takes considerably and variably longer (10s, 30s, 16s) but all the subsequent queries on that time range are <1s. Are there other tweaks that can help avoid this or is it just how the Postgres internals work?
It's probably due to caching; the first query has to hit the disk, but subsequent queries find all of the data in RAM. Increasing the cache size (by raising shared_buffers and/or adding more RAM) might help somewhat, but there's no silver bullet.
Other than that, you could try to reduce the amount of data which the query needs to consider, by creating either a partial index or a table partition based on your query's status and deleted_at constraints.
0

So add an index for events.deleted_at to get rid of the nasty sequential scan. What does it look like after that?

1 Comment

It takes a while to add an index to this table, so in the mean time I just removed the WHERE "events"."deleted_at" is null and the query still returned in the same amount of time. Posted the explain analyze output of that. Looks like the sequential scan is still running

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.