Correlated subqueries in Snowflake doesn't work

Question

I'm trying to run the following query in Snowflake but it fails with Unsupported subquery type cannot be evaluated. The query is valid in other SQL engines such as Postgresql and Presto so it looks like Snowflake doesn't support this type of query.

SELECT first_action.date, DATEDIFF('day', first_action.date, returning_action.date) - 1 as diff, APPROXIMATE_SIMILARITY(select MINHASH_COMBINE(value) from (select first_action.user_id_set as value union all select returning_action.user_id_set)) _set FROM (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as first_action JOIN (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as returning_action ON (first_action.date < returning_action.date AND dateadd(day, 14, first_action.date) >= returning_action.date) group by 1,2

The query is a typical cohort query that uses MinHash. We calculate the MinHash for each day, join the next 14 days and merge the results and finally calculate the final result.

Since MinHash doesn't have a linear MINHASH_COMBINE function, we had to use a subquery with UNION all in order to make it work but that also didn't work out. :/

We're stuck right now since we don't really know any workaround. Any help is appreciated!

Nilesh Ingle · Accepted Answer · 2019-04-06 19:12:23Z

Not sure if this will work, tried separating the first_action and returning_action using WITH statement:

WITH first_action as ( SELECT TRY_CAST(_time AS DATE) as date, MINHASH(100, _user) as user_id_set FROM events GROUP BY 1 ), returning_action as ( SELECT TRY_CAST(_time AS DATE) as date, MINHASH(100, _user) as user_id_set FROM events GROUP BY 1 ), SELECT first_action.date, DATEDIFF('day', fa.date, ra.date) - 1 as diff, APPROXIMATE_SIMILARITY( SELECT MINHASH_COMBINE(value) FROM ( SELECT fa.user_id_set AS value FROM first_action fa UNION ALL SELECT ra.user_id_set AS value FROM returning_action ra ) ) _set FROM first_action fa JOIN returning_action ra ON (fa.date < ra.date AND DATEADD(day, 14, fa.date) >= ra.date) GROUP BY 1,2

Simeon Pilgrim · Accepted Answer · 2019-04-17 00:40:48Z

So the major trick is that all these MINHASH_ functions window functions so you need to build a grouping key on the data.

So using this as my example data:

CREATE TABLE events(_user number, _time timestamp_ntz); INSERT INTO events VALUES (1,'2019-03-01'),(1,'2019-03-05'),(1,'2019-03-10'), (1,'2019-03-14'),(1,'2019-03-15'),(1,'2019-03-16'), (2,'2019-03-01'),(2,'2019-03-05'),(2,'2019-03-11'), (2,'2019-03-15');

The first set is to get the 14 days of data for the COMBINE

WITH actions AS ( SELECT _time::date as date ,dateadd(day, 14, date) as date14 ,minhash(100, _user) as user_id_set FROM events GROUP BY 1 ) SELECT fa.date ,ARRAY_AGG(ra.date) WITHIN GROUP (ORDER BY ra.date) ,MINHASH_COMBINE(ra.user_id_set) AS sets FROM actions AS fa JOIN actions AS ra ON (fa.date <= ra.date AND fa.date14 > ra.date) GROUP BY 1 ORDER BY 1;

this is similar to your code, but here I include on the RA the same day as FA. so I can group by FA.date, but have FA's data included. On the date range I was not sure if you are want 14 days later or 14 days worth of date. I assume the later, thus changed the end range termination.

Now we have the combined next 14 days of data for each day, we want to get the pairs of (in my code I do not put a max days of compare but just include all pairs). Now again the APPROXIMATE_SIMILARITY is a window function, so I build an Array which I will rip straight apart again, thus pivoting the data, which is what you're trying to do via the union all (this can bee seen in the pairs and unrolled CTE's)

WITH actions AS ( SELECT _time::date AS date ,dateadd(day, 14, date) as date14 ,minhash(100, _user) AS user_id_set FROM events GROUP BY 1 ), combined AS ( SELECT fa.date ,MINHASH_COMBINE(ra.user_id_set) AS sets FROM actions AS fa JOIN actions AS ra ON fa.date <= ra.date AND fa.date14 > ra.date GROUP BY 1 ), pairs AS ( SELECT fa.date ,DATEDIFF('day', fa.date, ra.date) AS diff ,ARRAY_CONSTRUCT(fa.sets,ra.sets) AS comp_set FROM combined AS fa JOIN combined AS ra ON fa.date < ra.date ), unrolled AS ( SELECT date ,diff ,f.value AS sets FROM pairs p, LATERAL FLATTEN(input => p.comp_set) f ) SELECT date ,diff ,APPROXIMATE_SIMILARITY(sets) FROM unrolled GROUP BY 1,2 ORDER BY 1,2;

and thus you get results for all days

DATE DIFF APPROXIMATE_SIMILARITY(SETS) 2019-03-01 4 1 2019-03-01 9 1 2019-03-01 10 1 2019-03-01 13 1 2019-03-01 14 1 2019-03-01 15 0.51 2019-03-05 5 1 2019-03-05 6 1 2019-03-05 9 1 2019-03-05 10 1 2019-03-05 11 0.51 2019-03-10 1 1 2019-03-10 4 1 2019-03-10 5 1 2019-03-10 6 0.51 2019-03-11 3 1 2019-03-11 4 1 2019-03-11 5 0.51 2019-03-14 1 1 2019-03-14 2 0.51 2019-03-15 1 0.51

Collectives™ on Stack Overflow

Correlated subqueries in Snowflake doesn't work

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related