Joe's answer is a great neat solution. I feel one should always consider how the data is distributed and sorted when you are working in Redshift. It can have a dramatic impact on performance.
Building on Joe's great answer: I will materialise the sample events. In practise the events will be in a table.
drop table if exists public.temporary_events; create table public.temporary_events AS select ts::timestamp as ts ,lead from ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead) ;
Now run explain:
explain WITH time_dimension AS (SELECT dtm ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */ /* NB: `sysdate` could be substituted for an arbitrary ending timestamp */ (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm FROM /* Generate a number sequence of 100,000 values from a large internal table */ (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn) SELECT dtm_half_hour ,COUNT(CASE WHEN lead THEN 1 END) FROM time_dimension td LEFT JOIN public.temporary_events e ON td.dtm = e.ts WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00' GROUP BY 1 -- ORDER BY 1 Just to simply the job a little
The output is:
XN HashAggregate (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9) -> XN Hash Left Join DS_DIST_BOTH (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9) Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval)) Inner Dist Key: e."ts" Hash Cond: ("outer"."?column2?" = "inner"."ts") -> XN Subquery Scan rn (cost=0.00..14.95 rows=1 width=8) Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone)) -> XN Limit (cost=0.00..1.95 rows=130 width=0) -> XN Window (cost=0.00..1.95 rows=130 width=0) -> XN Network (cost=0.00..1.30 rows=130 width=0) Send to slice 0 -> XN Seq Scan on stl_scan (cost=0.00..1.30 rows=130 width=0) -> XN Hash (cost=0.04..0.04 rows=4 width=9) -> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Kablamo!
As Joe says you may well use this pattern merrily without issue. However once your data gets sufficiently large OR your SQL logic complex you may want to optimise. If for no other reason you might like to understand the explain plan when you add more sql logic into your code.
Three areas we can look at:
- The Join. Make the join between both sets of data work on the same datatype. Here we join a timestamp to an interval.
- Data distribution. Materialise and distribute both tables by timestamp.
- Data sorting. If events is sorted by this timestamp and the time dimension is sorted by both timestamps then you may be able to complete the entire query using a merge join without any data moving and without sending the data to the leader node for aggregation.
Observe:
drop table if exists public.temporary_time_dimension; create table public.temporary_time_dimension distkey(dtm) sortkey(dtm, dtm_half_hour) AS (SELECT dtm::timestamp as dtm ,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */ /* NB: `sysdate` could be substituted for an arbitrary ending timestamp */ (SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm FROM /* Generate a number sequence of 100,000 values from a large internal table */ (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn) ; drop table if exists public.temporary_events; create table public.temporary_events distkey(ts) sortkey(ts) AS select ts::timestamp as ts ,lead from ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead) ; explain SELECT dtm_half_hour ,COUNT(CASE WHEN lead THEN 1 END) FROM public.temporary_time_dimension td LEFT JOIN public.temporary_events e ON td.dtm = e.ts WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00' GROUP BY 1 --order by dtm_half_hour
This then gives:
XN HashAggregate (cost=1512.67..1512.68 rows=1 width=9) -> XN Merge Left Join DS_DIST_NONE (cost=0.00..1504.26 rows=1682 width=9) Merge Cond: ("outer".dtm = "inner"."ts") -> XN Seq Scan on temporary_time_dimension td (cost=0.00..1500.00 rows=1682 width=16) Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone)) -> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Important caveats:
- I've taken the order by out. putting it back in will result in the data being sent to the leader node for sorting. If you can do away with the sort then do away with the sort!
- I'm certain that choosing timestamp as the events table sort key will NOT be ideal in many situations. I just thought I'd show what is possible.
- I think you will likely want to have a time dimension created with diststyle all and sorted. This will ensure that your joins do not generate network traffic.
generate_series()function. Refer this for unsupported postgresql featuresselect * from generate_series(0, (24*60), 30) n;in redshift, it runs ok.generate_serieswill be working in parent node. If you try to access the redshift table in query which has generate_series, it will through you this error, since the child node will not support generate_series() function. If your query does not access redshift tables, then generate_series() function will give you result.generate_series()can be used to create the table.generate_series(). Redshift does not support it. You have to create such table with multiple select queries.