2

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.

I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.

Sharing one of the queries that we run, along with the Query Plan. The query is taking 20 seconds to execute.

Query

SELECT date_trunc('day', ti) as date, count(distinct deviceID) AS COUNT FROM live_events WHERE brandID = 3927 AND ti >= '2017-08-02T00:00:00+00:00' AND ti <= '2017-09-02T00:00:00+00:00' GROUP BY 1 

Primary key
brandID

Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name

QUERY PLAN

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

5
  • 1
    You mentioned "queries are taking too much time to execute" and gave an example of 20 seconds, but what are you aiming for (i.e. what would be an acceptable time for this query)? Also - what is the distribution key of the live_events table? Commented Sep 10, 2017 at 23:12
  • Also, how many nodes and what node types are you using? Commented Sep 11, 2017 at 1:02
  • @Nathan i am expecting it to take less than a second. As mentioned in the ques, we have set "brandID" as the primary key and brandID, ti, event_name as interleaved sort keys. No other keys have been defined. Commented Sep 11, 2017 at 4:25
  • @JohnRotenstein We are using a single node of type dc1.large Commented Sep 11, 2017 at 4:27
  • Primary Key and Distribution Key are two different properties in Redshift - the primary key is really just a query hint, but the distribution key defines how the data is physically distributed across the Redshift nodes and is critical to performance. Commented Sep 11, 2017 at 20:53

3 Answers 3

6

You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.

Here's some ways you could improve the performance:

More nodes

Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.

SORTKEY

For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.

For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.

Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.

SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.

The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.

DISTKEY

The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.

Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.

VACUUM

VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.

Experiment!

Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks John. We will try as per your suggestions. To change the sorting of the table, we will have to create the table again. Can you suggest a better way to migrate the 125 million data into new table. And how much time will it take to complete the migration.
Yes. The best way is to create a new table with your preferred DISTKEY and SORTKEY. Then do a INSERT INTO new-table SELECT * FROM old-table to copy the data across. You can then perform tests to compare the speed, without impacting your original table.
How much time will it take to copy this amount of data?
It depends upon the size of your table. It will need to sort the data, so that will take some processing. Only way to know is to try it!
0

Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/

1 Comment

Please don't put only a link in your answers. These can go bad and then your possibly useful answer will be lost. So add the relevant context/steps to your answer and then link to the source where you got it from.
0

I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.

-- This date operation is expensive date_trunc('day', ti) as date 

If you have the luxury you should store the date in the format you need in an additional column.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.