Redshift GROUP BY time interval

Question

Currently, I have the following raw data in redshift.

timestamp ,lead ================================== "2008-04-09 10:02:01.000000",true "2008-04-09 10:03:05.000000",true "2008-04-09 10:31:07.000000",true "2008-04-09 11:00:05.000000",false ...

So, I would like to generate an aggregated data, with interval of 30 mins. My wished outcome is

timestamp ,count ================================== "2008-04-09 10:00:00.000000",2 "2008-04-09 10:30:00.000000",1 "2008-04-09 11:00:00.000000",0 ...

I had referred to https://stackoverflow.com/a/12046382/3238864 , which is valid for PostgreSQL.

I try to mimic the code posted, by using

with thirty_min_intervals as ( select (select min(timestamp)::date from events) + ( n || ' minutes')::interval start_time, (select min(timestamp)::date from events) + ((n+30) || ' minutes')::interval end_time from generate_series(0, (24*60), 30) n ) select count(CASE WHEN lead THEN 1 END) from events e right join thirty_min_intervals f on e.timestamp >= f.start_time and e.timestamp < f.end_time group by f.start_time, f.end_time order by f.start_time;

However, I'm getting error

[0A000] ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.

May I know, what is the good way to perform aggregation data calculation, of N interval, in redshift.

Amazon-redshift does not support generate_series() function. Refer this for unsupported postgresql features — Viki888
– Viki888, Commented Feb 6, 2017 at 8:23
But if you run the bare command select * from generate_series(0, (24*60), 30) n; in redshift, it runs ok. — Cheok Yan Cheng
– Cheok Yan Cheng, Commented Feb 6, 2017 at 8:54
Yes. The generate_series will be working in parent node. If you try to access the redshift table in query which has generate_series, it will through you this error, since the child node will not support generate_series() function. If your query does not access redshift tables, then generate_series() function will give you result. — Viki888
– Viki888, Commented Feb 6, 2017 at 9:09
One option is to create a table that has a list of times in 30-minute intervals, then join against it. I think that generate_series() can be used to create the table. — John Rotenstein
– John Rotenstein, Commented Feb 6, 2017 at 9:43
@JohnRotenstein You can not create a table using generate_series(). Redshift does not support it. You have to create such table with multiple select queries. — Viki888
– Viki888, Commented Feb 7, 2017 at 8:01

hibernado · Accepted Answer · 2018-11-27 17:39:59Z

Joe's answer is a great neat solution. I feel one should always consider how the data is distributed and sorted when you are working in Redshift. It can have a dramatic impact on performance.

Building on Joe's great answer: I will materialise the sample events. In practise the events will be in a table.

drop table if exists public.temporary_events; create table public.temporary_events AS select ts::timestamp as ts ,lead from ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead) ;

Now run explain:

explain WITH time_dimension AS (SELECT dtm ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */ /* NB: `sysdate` could be substituted for an arbitrary ending timestamp */ (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm FROM /* Generate a number sequence of 100,000 values from a large internal table */ (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn) SELECT dtm_half_hour ,COUNT(CASE WHEN lead THEN 1 END) FROM time_dimension td LEFT JOIN public.temporary_events e ON td.dtm = e.ts WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00' GROUP BY 1 -- ORDER BY 1 Just to simply the job a little

The output is:

XN HashAggregate (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9) -> XN Hash Left Join DS_DIST_BOTH (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9) Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval)) Inner Dist Key: e."ts" Hash Cond: ("outer"."?column2?" = "inner"."ts") -> XN Subquery Scan rn (cost=0.00..14.95 rows=1 width=8) Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone)) -> XN Limit (cost=0.00..1.95 rows=130 width=0) -> XN Window (cost=0.00..1.95 rows=130 width=0) -> XN Network (cost=0.00..1.30 rows=130 width=0) Send to slice 0 -> XN Seq Scan on stl_scan (cost=0.00..1.30 rows=130 width=0) -> XN Hash (cost=0.04..0.04 rows=4 width=9) -> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)

Kablamo!

As Joe says you may well use this pattern merrily without issue. However once your data gets sufficiently large OR your SQL logic complex you may want to optimise. If for no other reason you might like to understand the explain plan when you add more sql logic into your code.

Three areas we can look at:

The Join. Make the join between both sets of data work on the same datatype. Here we join a timestamp to an interval.
Data distribution. Materialise and distribute both tables by timestamp.
Data sorting. If events is sorted by this timestamp and the time dimension is sorted by both timestamps then you may be able to complete the entire query using a merge join without any data moving and without sending the data to the leader node for aggregation.

Observe:

drop table if exists public.temporary_time_dimension; create table public.temporary_time_dimension distkey(dtm) sortkey(dtm, dtm_half_hour) AS (SELECT dtm::timestamp as dtm ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */ /* NB: `sysdate` could be substituted for an arbitrary ending timestamp */ (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm FROM /* Generate a number sequence of 100,000 values from a large internal table */ (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn) ; drop table if exists public.temporary_events; create table public.temporary_events distkey(ts) sortkey(ts) AS select ts::timestamp as ts ,lead from ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead) ; explain SELECT dtm_half_hour ,COUNT(CASE WHEN lead THEN 1 END) FROM public.temporary_time_dimension td LEFT JOIN public.temporary_events e ON td.dtm = e.ts WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00' GROUP BY 1 --order by dtm_half_hour

This then gives:

XN HashAggregate (cost=1512.67..1512.68 rows=1 width=9) -> XN Merge Left Join DS_DIST_NONE (cost=0.00..1504.26 rows=1682 width=9) Merge Cond: ("outer".dtm = "inner"."ts") -> XN Seq Scan on temporary_time_dimension td (cost=0.00..1500.00 rows=1682 width=16) Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone)) -> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)

Important caveats:

I've taken the order by out. putting it back in will result in the data being sent to the leader node for sorting. If you can do away with the sort then do away with the sort!
I'm certain that choosing timestamp as the events table sort key will NOT be ideal in many situations. I just thought I'd show what is possible.
I think you will likely want to have a time dimension created with diststyle all and sorted. This will ensure that your joins do not generate network traffic.

Joe Harris · Accepted Answer · 2017-02-16 21:01:07Z

You can use ROW_NUMBER() to generate a series. I use internal tables that I know to be large. FWIW, I would typically persist the time_dimension to a real table to avoid doing this repeatedly.

Here you go:

WITH events AS ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead) ,time_dimension AS (SELECT dtm ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */ /* NB: `sysdate` could be substituted for an arbitrary ending timestamp */ (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm FROM /* Generate a number sequence of 100,000 values from a large internal table */ (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn) SELECT dtm_half_hour ,COUNT(CASE WHEN lead THEN 1 END) FROM time_dimension td LEFT JOIN events e ON td.dtm = e.ts WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00' GROUP BY 1 ORDER BY 1 ;

On redshift it is mandatory to put a derived table like time_dimension into a real table. Because the 'time_dimension' is made from a couple of nested statements, the query optimiser does not know how the data is sorted and distributed across the database nodes. It will assume the worst and boink.
@hibernado Can you please clarify your comment? I use the pattern above daily without any issue.

Collectives™ on Stack Overflow

Redshift GROUP BY time interval

2 Answers 2

Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Linked

Related