Efficient GROUP BY a CASE expression in Amazon Redshift/PostgreSQL

Question

In analytics processing there is often a need to collapse "unimportant" groups of data into a single row in the resulting table. One way to do this is to GROUP BY a CASE expression where unimportant groups are coalesced into a single row via the CASE expression returning a single value, e.g., NULL for the groups. This question is about efficient ways to perform this grouping in Amazon Redshift, which is based on ParAccel: close to PosgreSQL 8.0 in terms of functionality.

As an example, consider a GROUP BY on type and url in a table where each row is a single URL visit. The goal is to perform aggregation such that one row is emitted for every (type, url) pair where the URL visit count exceeds a certain threshold and one (type, NULL) row is emitted for all (type, url) pairs where the visit count is under that threshold. The rest of the columns in the result table would have SUM/COUNT aggregates based on this grouping.

For example, the following data

+------+----------------------+-----------------------+ | type | url | < 50+ other columns > | +------+----------------------+-----------------------+ | A | http://popular.com | | | A | http://popular.com | | | A | < 9997 more times> | | | A | http://popular.com | | | A | http://small-one.com | | | B | http://tiny.com | | | B | http://tiny-too.com | |

should produce the following result table with a threshold of 10,000

+------+------------------------------------+--------------------------+ | type | url | visit_count | < SUM/COUNT aggregates > | +------+------------------------------------+--------------------------+ | A | http://popular.com | 10000 | | | A | | 1 | | | B | | 2 | |

Summary:

Amazon Redshift has certain subquery correlation limitations one needs to tip-toe around. Gordon Linoff's answer below (the accepted answer) shows how to perform a GROUP BY a CASE expression using double aggregation and replicating the expression in both the result column and the outer GROUP BY clause.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url) select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt from temp_counts group by type, (case when cnt >= 10000 then url end)

Further testing indicated that the double aggregation can be "unrolled" into a UNION ALL of independent queries involving each independent CASE expression. In this particular case on a sample data set with approximately 200M rows, this approach consistently performed about 30% faster. That result is schema and data-specific, however.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url) select * from temp_counts WHERE cnt >= 10000 UNION ALL SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts WHERE cnt < 10000 GROUP BY type

This suggests two general patterns for implementing and optimizing arbitrary disjoined grouping & summarization in Amazon Redshift. If performance is important for you, benchmark both.

Gordon Linoff · Accepted Answer · 2013-12-29 13:45:47Z

3

You would do this with two aggregations:

select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt from (select type, url, count(*) as cnt from t group by type, url ) t group by type, (case when cnt > XXX then url end) order by type, sum(cnt) desc;

edited Dec 29, 2013 at 13:45

answered Dec 28, 2013 at 21:02

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Sim Over a year ago

I played with this path at the beginning also but it generates the following error: An error occurred when executing the SQL command: select type, (case when cnt >=10000 then url end) as url, sum(cnt) as visit_cnt from (select type, url, ... ERROR: column "t.cnt" must appear in the GROUP BY clause or be used in an aggregate function [SQL State=42803]

Sim Over a year ago

I tracked down the error to this specific feature--references in a GROUP BY column to the results of a correlated subquery--not being supported by Amazon Redshift and updated the question with the info.

Gordon Linoff Over a year ago

@Sim . . . Try the version with the expression in the group by (I just modified it).

Sim Over a year ago

Hmm, if I externalize the nested SELECT to a WITH clause, the example above (with double aggregation) runs consistently about 30% slower than a UNION of two SELECTs: one WHERE cnt > XXX and the other of the form SELECT type, NULL as url, SUM(cnt) from temp_counts WHERE cnt < XXX GROUP BY 1. Does this strike you as an intuitive outcome? The final ordering should not matter much: the result has fewer than 10,000 rows.

Gordon Linoff Over a year ago

@Sim . . . Postgres materializes CTEs. So the aggregation is run only once. This probably significantly speeds up the union query. I would suggest doing a union all instead of union to be even faster.

|

usr · Accepted Answer · 2013-12-28 20:51:55Z

1

First, you group on type, url.
Then you group a 2nd time on type, case when visit_count < 10000 then NULL else url.

I've used SQL Server syntax, I hope that works for Postgres, too.

answered Dec 28, 2013 at 20:51

usr

172k38 gold badges251 silver badges380 bronze badges

3 Comments

Sim Over a year ago

Alas, that approach generates a SQL error. See my comments to Gordon's answer above. Looks like we need a different approach.

usr Over a year ago

That is a strange error because Gordon's query does not use subqueries at all. It uses a derived table. There is no decorrelation necessary. Did you modify his query before running it? Maybe you added a subquery.

Sim Over a year ago

Gordon's updated query works. Thanks for your answer and follow-up.

Collectives™ on Stack Overflow

Efficient GROUP BY a CASE expression in Amazon Redshift/PostgreSQL

2 Answers 2

9 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

3 Comments

Linked

Related