Redshift SQL query - optimize

Question

I have a query which is taking more than 15 mins to execute in Redshift. This query is being triggered using AWS Lambda which has a timeout of 15 mins. So, I wanted to check if there is a way to optimize the query to make it give results fast.

Here is my SQL query:

 insert into test.qa_locked select '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481', 'ABC-013505', 'ABC-013505-2-2020', user_id, cast(TIMEOFDAY() as timestamp) from ( select user_id from ( select contact_id from test.qa_locked ) where contact_cnt <= 1 ) )

Here is the plan:

XN Subquery Scan "*SELECT*" (cost=1000028198481.69..1000028198481.75 rows=1 width=218) -> XN Subquery Scan derived_table1 (cost=1000028198481.69..1000028198481.73 rows=1 width=210) -> XN Window (cost=1000028198481.69..1000028198481.71 rows=1 width=56) -> XN Sort (cost=1000028198481.69..1000028198481.70 rows=1 width=56) -> XN Network (cost=1645148.05..28198481.68 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_OUTER (cost=1645148.05..28198481.68 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.76..28091814.71 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.09..7491814.01 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.68..6805146.91 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.16..6438479.71 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.65..6071812.51 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.29..6071812.13 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.96..6071811.77 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_NONE (cost=1645144.50..5598477.96 rows=1 width=56) -> XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.47..5598477.91 rows=1 width=84) -> XN Hash NOT IN Join DS_DIST_OUTER (cost=1645142.59..5078476.00 rows=1 width=84) -> XN Hash NOT IN Join DS_BCAST_INNER (cost=1645142.57..4065142.63 rows=1 width=600) -> XN Hash Left Join DS_DIST_BOTH (cost=1201145.21..3221145.24 rows=1 width=1116) -> XN Seq Scan on contacts xa (cost=1201145.21..1201145.21 rows=1 width=640) -> XN Hash (cost=0.00..0.00 rows=1 width=556) -> XN Seq Scan on accounts ya (cost=0.00..0.00 rows=1 width=556) -> XN Hash (cost=443997.35..443997.35 rows=1 width=32) -> XN Subquery Scan "IN_subquery" (cost=23989.76..443997.35 rows=1 width=32) -> XN Unique (cost=23989.76..443997.34 rows=1 width=516) -> XN Nested Loop DS_BCAST_INNER (cost=23989.76..443997.34 rows=1 width=516) -> XN Seq Scan on accounts con (cost=0.00..0.00 rows=1 width=516) -> XN Hash NOT IN Join DS_DIST_OUTER (cost=23989.76..83997.32 rows=1 width=26) -> XN Seq Scan on campaign_exclusion_list cam (cost=0.00..7.53 rows=1 width=26) -> XN Hash (cost=23989.75..23989.75 rows=1 width=32) -> XN Subquery Scan "IN_subquery" (cost=0.00..23989.75 rows=1 width=32) -> XN Unique (cost=0.00..23989.74 rows=1 width=18) -> XN Seq Scan on campaign_inclusion_list (cost=0.00..23989.74 rows=1 width=18) -> XN Hash (cost=0.01..0.01 rows=1 width=516) -> XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=516) -> XN Unique (cost=0.00..0.00 rows=1 width=516) -> XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=516) -> XN Hash (cost=1.88..1.88 rows=1 width=210) -> XN Seq Scan on bh_email_open_clicks (cost=0.00..1.88 rows=1 width=210) -> XN Hash (cost=0.01..0.01 rows=1 width=210) -> XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=210) -> XN Unique (cost=0.00..0.00 rows=1 width=28) -> XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=28) -> XN Hash (cost=0.45..0.45 rows=1 width=210) -> XN Seq Scan on bh_leads (cost=0.00..0.45 rows=1 width=210) -> XN Hash (cost=0.32..0.32 rows=1 width=402) -> XN Subquery Scan "IN_subquery" (cost=0.30..0.32 rows=1 width=402) -> XN HashAggregate (cost=0.30..0.31 rows=1 width=402) -> XN Seq Scan on campaign_extraction_history (cost=0.00..0.30 rows=1 width=402) -> XN Hash (cost=0.35..0.35 rows=1 width=402) -> XN Subquery Scan "IN_subquery" (cost=0.33..0.35 rows=1 width=402) -> XN HashAggregate (cost=0.33..0.34 rows=1 width=402) -> XN Seq Scan on campaign_extraction_history (cost=0.00..0.33 rows=1 width=402) -> XN Hash (cost=0.50..0.50 rows=1 width=210) -> XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210) -> XN Hash (cost=0.50..0.50 rows=1 width=210) -> XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210) -> XN Hash (cost=0.40..0.40 rows=1 width=402) -> XN Seq Scan on campaign_extraction_history (cost=0.00..0.40 rows=1 width=402) -> XN Hash (cost=0.30..0.30 rows=30 width=402) -> XN Seq Scan on ce_locked_records_tb (cost=0.00..0.30 rows=30 width=402) -> XN Hash (cost=0.27..0.27 rows=1 width=210) -> XN Subquery Scan "IN_subquery" (cost=0.26..0.27 rows=1 width=210) -> XN HashAggregate (cost=0.26..0.26 rows=1 width=210) -> XN Seq Scan on bh_leads (cost=0.00..0.25 rows=1 width=210)

Please suggest if there are any ways to optimize this query.

Was this query written by hand, or is it generated by some BI tool? The first thing to notice is that it is very complex, with all those sub-selects. It would be great to get rid of them. The other thing is that it has 26 not in operators, which are notoriously bad for efficiency. A not in requires the selection of a whole lot of data, then checking that the desired value isn't in each returned row. That makes things very slow in any database. They also seem to be responsible for most of the cost calculations. There are 31 sub-queries (SELECT) in that query, too. — John Rotenstein
– John Rotenstein, Commented May 9, 2020 at 2:03
@JohnRotenstein the query generated using a stored procedure. All the code is written by hand. What would be an alternative way of not in which would be efficient? — dang
– dang, Commented May 9, 2020 at 4:22
A NOT IN can often be replaced by a LEFT OUTER JOIN. Then, confirm that a joined field is NULL. There is quite a bit of discussion on the Internet about this, eg: SQL performance on LEFT OUTER JOIN vs NOT EXISTS and Consider using NOT EXISTS instead of NOT IN with a subquery - Redgate Software and NOT IN vs. NOT EXISTS vs. OUTER APPLY vs. OUTER JOIN. — John Rotenstein
– John Rotenstein, Commented May 9, 2020 at 4:34
@JohnRotenstein are you able to rewrite my query to make it fast? — dang
– dang, Commented May 9, 2020 at 4:35
The EXPLAIN plan shows a cost figure. You should concentrate on reducing or removing the high costs. In addition, the high costs might be caused by the DS_DIST_INNER and DS_DIST_BOTH activities. These can typically be avoided by tables sharing the same DISTKEY or by replicating tables on all nodes. See: Evaluating the query plan - Amazon Redshift — John Rotenstein
– John Rotenstein, Commented May 9, 2020 at 11:19

Repr · Accepted Answer · 2020-05-11 13:31:58Z

This feels like a query that has been added to time and time again, with a lot of code repetition and many unnecessary table scans.

Understand that my main experience is with MSSQL and not redshift, but for most the same principles will apply.

 ( lower(xa.primary_function) in ( select lower(param_val) from ce_campaign_spec_tb where job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481' and param = 'primary_function' and relation_id = 4 ) and lower(xa.role) in ( select lower(param_val) from ce_campaign_spec_tb where job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481' and param = 'role' and relation_id = 4 ) and lower(xa.title) in ( select lower(title) from contacts con inner join ce_campaign_spec_tb camp on lower(con.title) ilike '%' || trim( both ' ' from camp.param_val ) || '%' where job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481' and param = 'title' and relation_id = 4 ) )

without knowing what this does, you seem to repeating this block of code for 5 times with the only change being relation_id. You start with id 4, then 2, then 1, then 3 and then 5, but other then the id nothing seems to change. there might be subtle differences, but now you start scanning the tables 5 seperate times instead of a single time with a single predicate. depending on the size of the tables this can be a fair amount of data you are scanning

few rows further:

and xa.contact_id not in ( select contact_id from bh_leads where (CURRENT_DATE - creation_date :: date) <= 60 and UPPER(LOB) = 'ABC' and agency_id = '1002' ) and xa.contact_id not in ( select contact_id from bh_leads where (CURRENT_DATE - creation_date :: date) <= 60 and UPPER(LOB) = 'ABC' and sponsor_id = '8306' )

again 2 table scans for virtually the same data, only difference being on checks the sponsor_id for a value and the other the agency_id. this could have been done in a single statement instead of 2

further down:

and email_id not in ( select distinct email_id from contacts where is_email_suppressed = 1 )

earlier you referenced contact (xa) and put this as a predicate in the where clause:

and xa.is_email_suppressed = 0

without knowing the exact schema of the tables in question i cannot be sure, but they seem to do largely the same thing.

also, from the Redshift documentation here: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

it seems you can create temporary tables for the duration of a single session. most of the subqueries could be prepared so you can join on the resultset. if you first prepare, for example, a temporary resultset for the campaign_extraction_history table with valid results you can replace the following predicates with a single left join:

 AND contact_id NOT IN ( select contact_id from campaign_extraction_history where sf_oms_campaign_id = 'ABC-013505-2-2020' and sf_campaign_id = 'ABC-013505' and (CURRENT_DATE - creation_date :: date) < 1 and channel = 'BOTH' and ( UPPER(STATUS) = 'EXTRACTED' OR UPPER(STATUS) = 'LAUNCHED' OR UPPER(STATUS) = 'CONFIRMED' ) ) AND contact_id NOT IN ( select contact_id from campaign_extraction_history where creation_date :: date = CURRENT_DATE and channel = 'BOTH' and ( UPPER(STATUS) = 'EXTRACTED' OR UPPER(STATUS) = 'LAUNCHED' OR UPPER(STATUS) = 'CONFIRMED' ) group by contact_id having count(*) > 10 ) AND contact_id NOT IN ( select contact_id from campaign_extraction_history where sf_campaign_id = 'ABC-013505' and channel = 'BOTH' and ( UPPER(STATUS) = 'EXTRACTED' OR UPPER(STATUS) = 'LAUNCHED' OR UPPER(STATUS) = 'CONFIRMED' ) group by contact_id having count(*) >= 3 )

there are probebly more places where you can combine query's and get data from tables just a single time. for example, you exclude many email_id values, but on various places in different statements and subqueries. they can most likely be done in a single statement.

maybe the best way to improve performance is to ask yourself what the query is trying to do and exclude, and then just rewrite the entire query. this might be a fair amount of work, but could end up being faster in the long run.

for your first point - I agree that I have repeated the same block 5 times. But because I want to put AND condition for same relation_id. For ex. relation_id = 1 should have the conditions in AND. Then I take the next one separated by OR. How to modify this?
replace the ' and relation_id = 4' with 'and relation_id between 1 and 5', or substitute 1 and 5 with a min and max value from another query. if the valid ID's arent always a 1 increment you can also use CTE's or temp tables to build up valid ID's (or any other subselection ) for use within the transaction
I need to compare values of primary_function, role, function for same relation_id. It cannot be between 1 and 5.
if you use relation id between 1 and 5 you get 5 results. then you find the single row where all columns match (primary_function, role and function)

Collectives™ on Stack Overflow

Redshift SQL query - optimize

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related