Return to Question

spelling

edited Aug 9, 2021 at 2:10

3.7k
28
43
40

I have a spark df wichwhich I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A	B	C	Snap
1	2	3	2019-12-29
1	2	4	2019-12-31

where the primary key is formed by fields A and B. I need to create a new field to indicate wichwhich register is active (the last snap for each set of rows with the same PK). So I need something like this:

A	B	C	Snap	activity
1	2	3	2019-12-29	false
1	2	4	2019-12-31	true

I have done this by creating an auxiliarauxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.

Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the actitivityactivity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:

A	B	C	Snap	activity	end
1	2	3	2019-12-29	false	2019-12-30
1	2	4	2019-12-31	true

I have a spark df wich I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A	B	C	Snap
1	2	3	2019-12-29
1	2	4	2019-12-31

where the primary key is formed by fields A and B. I need to create a new field to indicate wich register is active (the last snap for each set of rows with the same PK). So I need something like this:

A	B	C	Snap	activity
1	2	3	2019-12-29	false
1	2	4	2019-12-31	true

I have done this by creating an auxiliar df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.

Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the actitivity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:

A	B	C	Snap	activity	end
1	2	3	2019-12-29	false	2019-12-30
1	2	4	2019-12-31	true

I have a spark df which I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A	B	C	Snap
1	2	3	2019-12-29
1	2	4	2019-12-31

where the primary key is formed by fields A and B. I need to create a new field to indicate which register is active (the last snap for each set of rows with the same PK). So I need something like this:

A	B	C	Snap	activity
1	2	3	2019-12-29	false
1	2	4	2019-12-31	true

I have done this by creating an auxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.

Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the activity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:

A	B	C	Snap	activity	end
1	2	3	2019-12-29	false	2019-12-30
1	2	4	2019-12-31	true

Source Link

asked Aug 8, 2021 at 21:14

batman23

Spark aggregation with window functions

I have a spark df wich I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A	B	C	Snap
1	2	3	2019-12-29
1	2	4	2019-12-31

A	B	C	Snap	activity
1	2	3	2019-12-29	false
1	2	4	2019-12-31	true

A	B	C	Snap	activity	end
1	2	3	2019-12-29	false	2019-12-30
1	2	4	2019-12-31	true

pyspark

Collectives™ on Stack Overflow

Return to Question

Spark aggregation with window functions