1

I have 2 tables: mailings and form_responses. The form_responses table has a foreign key with the mailings table, so a mailing has one-to-many form_responses. What I want to do is calculate mailing stats: per mailing get total form_responses and when the last form_response was received. I want to return the first 20 mailings that have last received a form_response. But when it has to query 50k+ rows, the query gets really slow.

mailings (~100k rows)

 Column | Type | Modifiers --------------+-----------------------------+--------------------------------------------------- id | integer | pkey project_id | integer | not null deleted | boolean | active | boolean | created_at | timestamp with time zone | Indexes: "mailings_pkey" PRIMARY KEY, btree (id) "mailing_on_project_id_where_not_deleted_and_active" btree (project_id) (WHERE deleted IS NOT true AND active IS TRUE) + individual indexes on almost all columns 

form_responses (~6m rows)

 Column | Type | Modifiers --------------+-----------------------------+--------------------------------------------------- id | integer | pkey project_id | integer | not null mailing_id | integer | deleted | boolean | is_unanswered| boolean | created_at | timestamp with time zone | Indexes: "form_responses_pkey" PRIMARY KEY, btree (id) "form_responses_on_pid_cr_time_where_not_del_not_is_unans" btree (project_id, created_time) (WHERE deleted IS NOT true AND is_uanswered IS NOT TRUE) + individual indexes on almost all columns 

I want to get the mailing stats of a project (id=1), within a date range. I only want the stats of mailings that are not deleted and active and only count non deleted responses.

So this gives me the following query

EXPLAIN (ANALYZE, BUFFERS) SELECT count(DISTINCT form_responses.id) AS total, max(form_responses.created_time) AS form_response_last_received, (SELECT mailings.name FROM mailings WHERE id = form_responses.mailing_id) AS name, form_responses.mailing_id AS mailing_id FROM form_responses JOIN mailings ON mailings.id = form_responses.mailing_id WHERE form_responses.deleted IS NOT true AND form_responses.is_unanswered IS NOT true AND form_responses.created_time >= '2017-08-02 00:00:00' AND form_responses.created_time <= '2018-01-29 00:00:00' AND form_responses.project_id = 1 AND mailings.deleted IS NOT true AND mailings.project_id = 1 AND mailings.active IS true GROUP BY form_responses.mailing_id ORDER BY form_response_last_received DESC LIMIT 20 OFFSET 0 

This gives me the following output:

Limit (cost=308997.69..308997.74 rows=20 width=16) (actual time=7762.554..7762.565 rows=20 loops=1) Buffers: shared hit=74535 read=13513 dirtied=1 -> Sort (cost=308997.69..308998.19 rows=197 width=16) (actual time=7762.552..7762.557 rows=20 loops=1) Sort Key: (max(form_responses.created_time)) Sort Method: top-N heapsort Memory: 26kB Buffers: shared hit=74535 read=13513 dirtied=1 -> GroupAggregate (cost=307039.18..308992.45 rows=197 width=16) (actual time=7673.523..7762.278 rows=686 loops=1) Group Key: form_responses.mailing_id Buffers: shared hit=74535 read=13513 dirtied=1 -> Sort (cost=307039.18..307111.58 rows=28961 width=16) (actual time=7673.277..7702.783 rows=137542 loops=1) Sort Key: form_responses.mailing_id Sort Method: quicksort Memory: 12592kB Buffers: shared hit=71790 read=13512 dirtied=1 -> Hash Join (cost=7376.42..304892.90 rows=28961 width=16) (actual time=44.576..7609.818 rows=137542 loops=1) Hash Cond: (form_responses.mailing_id = mailings.id) Buffers: shared hit=71790 read=13512 dirtied=1 -> Bitmap Heap Scan on form_responses (cost=3815.27..299862.55 rows=134811 width=16) (actual time=41.439..7499.305 rows=137543 loops=1) Recheck Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time <= '2018-01-29 00:00:00+00'::timestamp with time zone) AND (deleted IS NOT TRUE) AND (is_unanswered IS NOT TRUE)) Heap Blocks: exact=84014 Buffers: shared hit=71032 read=13512 -> Bitmap Index Scan on form_responses_on_pid_cr_time_where_not_del_not_is_unans (cost=0.00..3781.57 rows=134811 width=0) (actual time=23.892..23.892 rows=137543 loops=1) Index Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time <= '2018-01-29 00:00:00+00'::timestamp with time zone)) Buffers: shared hit=530 -> Hash (cost=3286.82..3286.82 rows=21946 width=4) (actual time=3.093..3.093 rows=2866 loops=1) Buckets: 4096 Batches: 1 Memory Usage: 101kB Buffers: shared hit=758 dirtied=1 -> Bitmap Heap Scan on mailings (cost=874.50..3286.82 rows=21946 width=4) (actual time=0.403..2.145 rows=2866 loops=1) Recheck Cond: ((project_id = 1) AND (deleted IS NOT TRUE) AND (active IS TRUE)) Heap Blocks: exact=732 Buffers: shared hit=758 dirtied=1 -> Bitmap Index Scan on mailing_on_project_id_where_not_deleted_and_active (cost=0.00..869.01 rows=21946 width=0) (actual time=0.315..0.315 rows=2866 loops=1) Index Cond: (project_id = 1) Buffers: shared hit=26 SubPlan 1 -> Index Scan using mailings_pkey on mailings mailings_1 (cost=0.42..8.44 rows=1 width=26) (actual time=0.003..0.003 rows=1 loops=686) Index Cond: (id = form_responses.mailing_id) Buffers: shared hit=2745 read=1 Planning time: 0.865 ms Execution time: 7764.951 ms 

In this example it queries ~135k rows, this can go up to almost ~600k. As you can see 99% of the time is spend on the Bitmap heap scan. I run a db.t2.large RDS database at AWS with 100gb (SSD) of storage, which runs postgres 9.4.7

  • Is there anyway to avoid doing the bitmap heap scan?
  • Or a way to improve the speed of this?
  • Can I create a covering index, so it doesn't have to read the form responses table itself?
  • Should I cluster the form_responses table on project_id and/or mailing_id?
  • Should I upgrade my hardware?
  • Am I asking too much of my Database?

1 Answer 1

0

You can avoid the bitmap scan by set enable_bitmapscan to off. Then you can see what it chooses instead, and how that choice performs. This is generally an exploratory tool, it is usually not a good idea to set it this way globally in a production system. But you can set just in one session, so there is no problem with doing the test on a production system.

If you run the same query a second time with the same parameters, is it much faster? If so you can improve the performance by pre-warming the form_responses table into memory (for example, using the pg_prewarm extension), assuming that your memory is large enough to hold it all. You can also try increasing effective_io_concurrency, if RDS will let you.

If you make an index just like mailing_on_project_id_where_not_deleted_and_active but also including mailing_id and id at the end of the column list, then it could act like a covering index and you might get it to use an index-only scan. You would have to keep your table well-vacuumed for this to be maximally effective. The default level of vacuuming might not be aggressive enough, depending on the pattern in which your data is updated.

PostgreSQL clusters tables based on an index, not based on columns. Clustering on mailing_on_project_id_where_not_deleted_and_active would likely help this particular query.

Based on the answers and results of the above, you might want to upgrade your hardware. If form_responses doesn't fit in RAM, more RAM would help. If that is not feasible and it needs to be read from disk, then upgrading from HDD to SSD would help (unless RDS is already on SSD, in which case you shouldn't have a problem in the first place unless it is getting throttled).

This query might benefit in some parallel query implementations introduced in PostgreSQL 9.6 and improved in 10, but I don't know if RDS supports those.

1
  • Thanks, I've got it working! The covering index did the trick. I added all columns I needed in a new index. I didn't use a WHERE clause for the index because this doesn't work for postgres<9.6 Thanks for your help! Commented Feb 6, 2018 at 16:46

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.