5

In analytics processing there is often a need to collapse "unimportant" groups of data into a single row in the resulting table. One way to do this is to GROUP BY a CASE expression where unimportant groups are coalesced into a single row via the CASE expression returning a single value, e.g., NULL for the groups. This question is about efficient ways to perform this grouping in Amazon Redshift, which is based on ParAccel: close to PosgreSQL 8.0 in terms of functionality.

As an example, consider a GROUP BY on type and url in a table where each row is a single URL visit. The goal is to perform aggregation such that one row is emitted for every (type, url) pair where the URL visit count exceeds a certain threshold and one (type, NULL) row is emitted for all (type, url) pairs where the visit count is under that threshold. The rest of the columns in the result table would have SUM/COUNT aggregates based on this grouping.

For example, the following data

+------+----------------------+-----------------------+ | type | url | < 50+ other columns > | +------+----------------------+-----------------------+ | A | http://popular.com | | | A | http://popular.com | | | A | < 9997 more times> | | | A | http://popular.com | | | A | http://small-one.com | | | B | http://tiny.com | | | B | http://tiny-too.com | | 

should produce the following result table with a threshold of 10,000

+------+------------------------------------+--------------------------+ | type | url | visit_count | < SUM/COUNT aggregates > | +------+------------------------------------+--------------------------+ | A | http://popular.com | 10000 | | | A | | 1 | | | B | | 2 | | 

Summary:

Amazon Redshift has certain subquery correlation limitations one needs to tip-toe around. Gordon Linoff's answer below (the accepted answer) shows how to perform a GROUP BY a CASE expression using double aggregation and replicating the expression in both the result column and the outer GROUP BY clause.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url) select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt from temp_counts group by type, (case when cnt >= 10000 then url end) 

Further testing indicated that the double aggregation can be "unrolled" into a UNION ALL of independent queries involving each independent CASE expression. In this particular case on a sample data set with approximately 200M rows, this approach consistently performed about 30% faster. That result is schema and data-specific, however.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url) select * from temp_counts WHERE cnt >= 10000 UNION ALL SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts WHERE cnt < 10000 GROUP BY type 

This suggests two general patterns for implementing and optimizing arbitrary disjoined grouping & summarization in Amazon Redshift. If performance is important for you, benchmark both.

2 Answers 2

3

You would do this with two aggregations:

select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt from (select type, url, count(*) as cnt from t group by type, url ) t group by type, (case when cnt > XXX then url end) order by type, sum(cnt) desc; 
Sign up to request clarification or add additional context in comments.

9 Comments

I played with this path at the beginning also but it generates the following error: An error occurred when executing the SQL command: select type, (case when cnt >=10000 then url end) as url, sum(cnt) as visit_cnt from (select type, url, ... ERROR: column "t.cnt" must appear in the GROUP BY clause or be used in an aggregate function [SQL State=42803]
I tracked down the error to this specific feature--references in a GROUP BY column to the results of a correlated subquery--not being supported by Amazon Redshift and updated the question with the info.
@Sim . . . Try the version with the expression in the group by (I just modified it).
Hmm, if I externalize the nested SELECT to a WITH clause, the example above (with double aggregation) runs consistently about 30% slower than a UNION of two SELECTs: one WHERE cnt > XXX and the other of the form SELECT type, NULL as url, SUM(cnt) from temp_counts WHERE cnt < XXX GROUP BY 1. Does this strike you as an intuitive outcome? The final ordering should not matter much: the result has fewer than 10,000 rows.
@Sim . . . Postgres materializes CTEs. So the aggregation is run only once. This probably significantly speeds up the union query. I would suggest doing a union all instead of union to be even faster.
|
1
  1. First, you group on type, url.
  2. Then you group a 2nd time on type, case when visit_count < 10000 then NULL else url.

I've used SQL Server syntax, I hope that works for Postgres, too.

3 Comments

Alas, that approach generates a SQL error. See my comments to Gordon's answer above. Looks like we need a different approach.
That is a strange error because Gordon's query does not use subqueries at all. It uses a derived table. There is no decorrelation necessary. Did you modify his query before running it? Maybe you added a subquery.
Gordon's updated query works. Thanks for your answer and follow-up.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.