0

I have the following tables:

The main lead table with close to 500M rows:

create table lead ( id integer, client_id integer, insert_date integer (a transformed date that looks like 20201231) ) create index lead_id_index on lead (id); create index lead_insert_date_index on lead (insert_date) include (id, client_id); create index lead_client_id_index on lead (client_id) include (id, insert_date); 

And then the other tables

create table last_activity_with_client ( lead_id integer, last_activity timestamp, last_modified timestamp, client_id integer ); create index last_activity_with_client_client_id_index on last_activity_with_client (client_id) include (lead_id, last_activity); create index last_activity_with_client_last_activity_index on last_activity_with_client (last_activity desc); create index last_activity_with_client_lead_id_client_id_index on last_activity_with_client (lead_id, client_id); create table lead_last_response_time ( lead_id integer, last_response_time timestamp, last_modified timestamp ); create index lead_last_response_time_last_response_time_index on lead_last_response_time (last_response_time desc); create index lead_last_response_time_lead_id_index on lead_last_response_time (lead_id); create table lead_last_response_time ( lead_id integer, last_response_time timestamp, last_modified timestamp ); create index lead_last_response_time_last_response_time_index on lead_last_response_time (last_response_time desc); create index lead_last_response_time_lead_id_index on lead_last_response_time (lead_id); create table date_dimensions ( key integer, (a transformed date that looks like 20201231) date date, description varchar(256), day smallint, month smallint, quarter char(2), year smallint past_30 boolean ); create index date_dimensions_key_index on date_dimensions (key); 

I try running the following query on different client_id and it is always slowed down by the bitmap index scan on client_id in the lead_table

EXPLAIN ANALYZE with TempResult AS ( select DISTINCT lead.id AS lead_id, last_activity_join.last_activity, lead_last_response_time.last_response_time from lead left join (select * from last_activity_with_client where client_id = 13189) last_activity_join on lead.id = last_activity_join.lead_id left join lead_last_response_time lead_last_response_time on lead.id = lead_last_response_time.lead_id join date_dimensions date_dimensions on lead.insert_date = date_dimensions.key where (date_dimensions.past_30 = true) and (lead.client_id in (13189)) ), TempCount AS ( select COUNT(*) as total_rows fromt TempResult ) select * from TempResult, TempCount order by last_response_time desc NULLS LAST limit 25 offset 1; 

A few results: explain analyze result 2

As you can see, it's using the index but it's quite slow. Always more than 50 seconds. What can I do to make this query run faster? I have some freedom to change the query and the tables too.

10
  • You don't use TempCount, so you can start by just getting rid of that. Commented Sep 16, 2020 at 21:20
  • Your result has to be for client_id in (13189) (or some other particular client_id) or you're doing that for testing purposes? Commented Sep 16, 2020 at 21:24
  • @GordonLinoff edited the query Commented Sep 16, 2020 at 21:43
  • @StefanDzalev All of the queries are always filtered on the client_id. It's not just for testing purposes. Commented Sep 16, 2020 at 21:43
  • 1
    Could you edit the question and qualify all columns with the table name? It is very hard to read the query otherwise. Commented Sep 17, 2020 at 2:48

2 Answers 2

1
create index lead_client_id_index on lead (client_id) include (id, insert_date); 

For efficient usage in this query, this should instead be on lead (client_id, insert_date, id). Using the INCLUDE just makes the index less useful, without accomplishing anything. I think that the only good reasons to use INCLUDE is if the index is unique on a subset of columns, or if the column being INCLUDEd is of a type which doesn't support btree operations.

But even the existing index does seem surprisingly slow. I wonder if there something wrong with it, like fragmentation, or maybe it is sitting on a damaged part of the disk and reads have to retried repeatedly before succeeding.

Sign up to request clarification or add additional context in comments.

1 Comment

I created this index and ran VACUUM. For date ranges smaller than 30 days it gives under 1s. But for larger time ranges it deteriorates quickly and reaches 15+ seconds.
0
Try this: EXPLAIN ANALYZE with TempResult AS ( select DISTINCT lead.id AS lead_id, last_activity, last_response_time from ( select key from date_dimensions where past_30 = true ) date_dimensions join (select id, insert_date from lead where client_id = 13189 ) lead on lead.insert_date = date_dimensions.key left join ( select lead_id, last_activity from last_activity_with_client where client_id = 13189 ) last_activity_join on lead.id = last_activity_join.lead_id left join lead_last_response_time lead_last_response_time on lead.id = lead_last_response_time.lead_id ), TempCount AS ( select COUNT(*) as total_rows from TempResult ) select * from TempResult, TempCount order by last_response_time desc NULLS LAST limit 25 offset 1; 

or this:

 EXPLAIN ANALYZE with TempResult AS ( select DISTINCT lead.id AS lead_id, last_activity, last_response_time from date_dimensions date_dimensions join (select id, insert_date from lead where client_id = 13189 ) lead on lead.insert_date = date_dimensions.key left join ( select lead_id, last_activity from last_activity_with_client where client_id = 13189 ) last_activity_join on lead.id = last_activity_join.lead_id left join lead_last_response_time lead_last_response_time on lead.id = lead_last_response_time.lead_id where date_dimensions.past_30 = true ), TempCount AS ( select COUNT(*) as total_rows from TempResult ) select * from TempResult, TempCount order by last_response_time desc NULLS LAST limit 25 offset 1; 

3 Comments

Thanks! I've tried this approach but a simple SELECT * FROM lead where client_id=12345 itself takes a long time even with index usage. Sometimes it's bitmap index scan and sometimes it's parallel seq scan depending on how much data the client holds. But in all cases, it take way too long.
I suggest you check whether there is fragmentation on the client_id index. If it is fragmented, you will have to reorganize it, which will make your query faster. stackoverflow.com/questions/52444912/…
Thanks. I followed your post and ran VACUUM FULL but I still see the same response times.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.