(Athena/Presto) Window function (lag) is not reading the previous lag row

Question

I have two tables, credit_memos and credit_memo_items and want to show each event individually and will thus do a insert (doing union in this case which is logically the same).

I know the key for both tables and am doing a union with them while sharing the same partition key credit_memo_dd that I'll be using for the window.

I'm trying to use lag to fill up the rest of the null rows as shown.

The code for the second column looks as follows:

coalesce(invoice_nk, lag(invoice_nk, 1, '') over (partition by cm_dd order by event_at)) as invoice_nk

For some reason the third row shown in the picture is not able to interpret the previous lagged row and insert a value its row as a result. I presume it may have to do something where Athena does not actually execute the lag function recursively and takes the previous row in each partition literally and doesn't wait for the previous rows to get executed.

Is there a way to grab values recursively as I'm trying to do?

Thanks.

Update:

Here's the full query:

with cme as( with cm as( select -- change_log row_number() over (partition by id order by updated_at desc) r , cast(cm.invoice_id as varchar(99)) as invoice_nk , cast(cm.id as varchar(99)) as cm_dd , cast(cm.created_at as timestamp) as cm_created_at , cast(cm.updated_at as timestamp) as cm_updated_at from "postgres-replica-parquet".credit_memos cm ) select case when r = 1 then cm_created_at else cm_updated_at end as cm_begin_at , null as cmi_begin_at , invoice_nk , cm_dd , cm_created_at , cm_updated_at from cm ) , cmi as( with ci as( select -- ci.change_log row_number() over (partition by id order by updated_at desc) r , cast(ci.credit_memo_id as varchar(99)) as cm_id , cast(ci.created_at as timestamp) as cmi_created_at , cast(ci.updated_at as timestamp) as cmi_updated_at from "postgres-replica-parquet".credit_memo_items ci ) select -- row 1 is the snapshot when cdc first started null as cm_begin_at , case when r = 1 then cmi_created_at else cmi_updated_at end as cmi_begin_at , null as invoice_nk , cm_id as cm_dd , null as cm_created_at , null as cm_updated_at from ci ) , combined as( with all as( select * from cme union select * from cmi ) select greatest( coalesce(cm_begin_at, cast(date_parse('1970-01-01', '%Y-%m-%d') as timestamp)) , coalesce(cmi_begin_at, cast(date_parse('1970-01-01', '%Y-%m-%d') as timestamp)) ) as event_at , * from all ) select event_at , cm_begin_at as event_credit_memo_last_at , cmi_begin_at as event_credit_memo_item_last_at , coalesce(invoice_nk, lag(invoice_nk, 1, '') over (partition by cm_dd order by event_at)) as invoice_nk , cm_dd as credit_memo_dd from combined

And a snapshot of the full result:

Rows 3 and 12 should have the invoice_nk based on the credit_memo

cm is the subquery to grab the credit memos

cmi is the subquery to grab the credit memo items

combined is just to precompute the order by key in the window functions

Note that both are pulled from a database that has the full change log (CDC).

The end goal is to make one row for each event, I'm defining an event for the final table to be any time either credit memo or credit memo item undergoes a change. I would like to be able to snapshot the state of a credit memo combined with the credit memo item by windowing at the event_at for reporting.

providing tables schemas, sample data, and full query will get you better help. your window functions works correctly in SQL server ( all i have) but that may be more complexities to your query that is resulting in your issue or it may be an athena issue. grasping at straws, but maybe try order by coalesce(cm_date,cmi_date) instead of event_at — Bob Klimes
– Bob Klimes, Commented Mar 11, 2022 at 15:51
LAG is a set-based windowing function; You are trying to use it like an iterative do-while loop. I may work up a solution for you that doesn't involve the LAG, greatest or row_number. But a question to you: do you anticipate that any dates are legitimately pre-1970? It appears that you are using Jan 1, 1970 as a "magic value". Why not just leave those as NULL? i.e. does any of your downstream logic depend on having a defined value there? Edit: just noticed this is a retro thread despite it being on the front page of dba. — mpag
– mpag, Commented Nov 13, 2024 at 15:45

Avarkx · Accepted Answer · 2022-03-15 01:16:37Z

The short answer is that the LAG window function is working exactly as intended in your query - you should probably be instead using a FIRST_VALUE window function for the task you appear to be trying to accomplish... but more on that later.

Screenshots are not typically considered sample data, but I've gone ahead and fabricated some schema elements and sample data from what you did provide. Nobody who isn't as bored as I currently am will do this, so in the future it would help you to do a similar thing as below yourself.

IF ( OBJECT_ID( 'tempdb.dbo.#t_CreditMemo', 'U' ) IS NULL ) BEGIN --DROP TABLE #t_CreditMemo; CREATE TABLE #t_CreditMemo ( CreditMemo_PK BIGINT IDENTITY( 1, 1 ) NOT NULL, InvoiceID UNIQUEIDENTIFIER, CreditMemoID UNIQUEIDENTIFIER, DateCreated DATETIME, DateUpdated DATETIME ); INSERT INTO #t_CreditMemo ( InvoiceID, CreditMemoID, DateCreated, DateUpdated ) VALUES ( '08797CA7-F843-4964-8D8F-46E41A2F232B', '15F6FEBB-0B43-4CD0-880D-8F703A2C0692', '2022-01-28 15:13:33.757', '2022-01-28 15:13:33.757' ), ( '0D9A6C3D-987E-42A3-B59E-55551822F650', '06A31D99-9075-493C-8532-A5FCE090F167', '2020-08-28 16:42:56.627', '2020-08-28 16:42:56.627' ), ( 'DFDF1A21-2D0A-4826-9C76-78680C4CA0D9', '20AE5822-5846-4EE6-A512-A61EB427BC72', '2020-06-25 21:15:05.273', '2020-06-25 21:15:05.273' ), ( '36C39B5A-D18C-4DD6-9B5E-88DC203C5E7B', '1FDCFEDD-0A3A-40A7-A2AF-F435F0486134', '2021-02-12 20:26:04.780', '2021-02-12 20:26:04.780' ), ( '45CED7BB-63F4-4A9D-A5EB-9BFD18332E32', '04A450EE-8C3D-4573-9274-C63F82EFFEC8', '2021-06-16 21:07:19.770', '2021-06-16 21:07:19.770' ), ( '70452702-0592-4262-BF08-C8C88B4BC27B', '20BDCDEE-93E9-47F4-AE5B-B8D98D404680', '2020-05-18 20:15:19.000', '2020-05-18 20:15:19.000' ); END; IF ( OBJECT_ID( 'tempdb.dbo.#t_CreditMemoItem', 'U' ) IS NULL ) BEGIN --DROP TABLE #t_CreditMemoItem; CREATE TABLE #t_CreditMemoItem ( CreditMemo_PK BIGINT IDENTITY( 1, 1 ) NOT NULL, CreditMemoID UNIQUEIDENTIFIER, CreditMemoItemID UNIQUEIDENTIFIER, DateCreated DATETIME, DateUpdated DATETIME ); INSERT INTO #t_CreditMemoItem ( CreditMemoID, CreditMemoItemID, DateCreated, DateUpdated ) VALUES ( '15F6FEBB-0B43-4CD0-880D-8F703A2C0692', NEWID(), '2022-01-28 15:13:33.767', '2022-01-28 15:13:33.767' ), ( '06A31D99-9075-493C-8532-A5FCE090F167', NEWID(), '2020-08-28 16:42:56.633', '2020-08-28 16:42:56.633' ), ( '20AE5822-5846-4EE6-A512-A61EB427BC72', NEWID(), '2020-06-25 21:15:05.280', '2020-06-25 21:15:05.280' ), ( '20AE5822-5846-4EE6-A512-A61EB427BC72', NEWID(), '2020-06-25 21:15:05.283', '2020-06-25 21:15:05.283' ), ( '04A450EE-8C3D-4573-9274-C63F82EFFEC8', NEWID(), '2021-06-16 21:07:19.773', '2021-06-16 21:07:19.773' ), ( '04A450EE-8C3D-4573-9274-C63F82EFFEC8', NEWID(), '2021-06-16 21:07:19.777', '2021-06-16 21:07:19.777' ), ( '1FDCFEDD-0A3A-40A7-A2AF-F435F0486134', NEWID(), '2021-02-12 20:26:04.787', '2021-02-12 20:26:04.787' ); END;

Then you can review the differences in the output the window functions provide:

;WITH cte_CreditMemo AS ( SELECT Ordinal = ROW_NUMBER() OVER ( PARTITION BY CreditMemoID ORDER BY DateUpdated DESC ), InvoiceID, CreditMemoID, DateCreated, DateUpdated FROM #t_CreditMemo ), cte_CreditMemoEvent AS ( SELECT DateCreditMemoBegin = CASE WHEN Ordinal = 1 THEN DateCreated ELSE DateUpdated END, DateCreditMemoItemBegin = NULL, InvoiceID, CreditMemoID, DateCreated, DateUpdated FROM cte_CreditMemo ), cte_CreditMemoItem AS ( SELECT Ordinal = ROW_NUMBER() OVER ( PARTITION BY CreditMemoItemID ORDER BY DateUpdated DESC ), CreditMemoID, CreditMemoItemID, DateCreated, DateUpdated FROM #t_CreditMemoItem ), cte_CreditMemoItemEvent AS ( SELECT DateCreditMemoBegin = NULL, DateCreditMemoItemBegin = CASE WHEN Ordinal = 1 THEN DateCreated ELSE DateUpdated END, InvoiceID = NULL, CreditMemoID, CreditMemoItemID, DateCreated, DateUpdated FROM cte_CreditMemoItem ), cte_All AS ( SELECT DateCreditMemoBegin, DateCreditMemoItemBegin, InvoiceID, CreditMemoID, CreditMemoItemID = NULL FROM cte_CreditMemoEvent UNION ALL SELECT DateCreditMemoBegin, DateCreditMemoItemBegin, InvoiceID, CreditMemoID, CreditMemoItemID FROM cte_CreditMemoItemEvent ), cte_Combined AS ( SELECT DateEvent = GREATEST( DateCreditMemoBegin, DateCreditMemoItemBegin ), DateCreditMemoBegin, DateCreditMemoItemBegin, InvoiceID, CreditMemoID, CreditMemoItemID FROM cte_All ) SELECT DateEvent, DateCreditMemoBegin, DateCreditMemoItemBegin, InvoiceID = COALESCE( InvoiceID, LAG( InvoiceID, 1, NULL ) OVER ( PARTITION BY CreditMemoID ORDER BY DateEvent ) ), InvoiceID = FIRST_VALUE( InvoiceID ) OVER ( PARTITION BY CreditMemoID ORDER BY DateEvent ), CreditMemoID, CreditMemoItemID FROM cte_Combined ORDER BY CreditMemoID, DateEvent;

As you can see from the results of the above query, the FIRST_VALUE window function will fill the InvoiceID column as you have described is intended, whereas the LAG window function does indeed lag the results from the previous row, producing NULLs when the previous column value is NULL. This is because database operations are generally intended to be set-based operations - there is no recursion happening here.

It also may be worth pointing out the solution you're working on appears to be massive overkill for the problem, which very likely has a much simpler solution.

SELECT DateEvent = GREATEST( cm.DateCreated, cm.DateUpdated, e.DateCreated, e.DateUpdated ), cm.InvoiceID, e.CreditMemoID, e.CreditMemoItemID FROM ( SELECT CreditMemoID, CreditMemoItemID = NULL, DateCreated, DateUpdated FROM #t_CreditMemo UNION ALL SELECT CreditMemoID, CreditMemoItemID, DateCreated, DateUpdated FROM #t_CreditMemoItem ) e INNER JOIN #t_CreditMemo cm ON e.CreditMemoID = cm.CreditMemoID ORDER BY e.CreditMemoID, DateEvent;

Okay that helps a lot. I'll make sure to have better information in my questions next time. — Eugene Choi
– Eugene Choi, Commented Mar 15, 2022 at 13:51

Stack Exchange Network

(Athena/Presto) Window function (lag) is not reading the previous lag row

1 Answer 1

Hot Network Questions

(Athena/Presto) Window function (lag) is not reading the previous lag row

1 Answer 1

Related

Hot Network Questions