Skip to main content
AI Assist is now on Stack Overflow. Start a chat to get instant answers from across the network. Sign up to save and share your chats.
spelling
Source Link
Jason Aller
  • 3.7k
  • 28
  • 43
  • 40

I have a spark df wichwhich I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A B C Snap
1 2 3 2019-12-29
1 2 4 2019-12-31

where the primary key is formed by fields A and B. I need to create a new field to indicate wichwhich register is active (the last snap for each set of rows with the same PK). So I need something like this:

A B C Snap activity
1 2 3 2019-12-29 false
1 2 4 2019-12-31 true

I have done this by creating an auxiliarauxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.

Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the actitivityactivity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:

A B C Snap activity end
1 2 3 2019-12-29 false 2019-12-30
1 2 4 2019-12-31 true

I have a spark df wich I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A B C Snap
1 2 3 2019-12-29
1 2 4 2019-12-31

where the primary key is formed by fields A and B. I need to create a new field to indicate wich register is active (the last snap for each set of rows with the same PK). So I need something like this:

A B C Snap activity
1 2 3 2019-12-29 false
1 2 4 2019-12-31 true

I have done this by creating an auxiliar df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.

Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the actitivity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:

A B C Snap activity end
1 2 3 2019-12-29 false 2019-12-30
1 2 4 2019-12-31 true

I have a spark df which I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A B C Snap
1 2 3 2019-12-29
1 2 4 2019-12-31

where the primary key is formed by fields A and B. I need to create a new field to indicate which register is active (the last snap for each set of rows with the same PK). So I need something like this:

A B C Snap activity
1 2 3 2019-12-29 false
1 2 4 2019-12-31 true

I have done this by creating an auxiliary df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.

Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the activity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:

A B C Snap activity end
1 2 3 2019-12-29 false 2019-12-30
1 2 4 2019-12-31 true
Source Link
batman23
  • 63
  • 1
  • 6

Spark aggregation with window functions

I have a spark df wich I need to use to identify the last active record for each primary key based on a snapshot date. An example of what I have is:

A B C Snap
1 2 3 2019-12-29
1 2 4 2019-12-31

where the primary key is formed by fields A and B. I need to create a new field to indicate wich register is active (the last snap for each set of rows with the same PK). So I need something like this:

A B C Snap activity
1 2 3 2019-12-29 false
1 2 4 2019-12-31 true

I have done this by creating an auxiliar df and then joining with the first one to bring back the active indicator but my original df is very big and I need something better in terms of performance. I have been thinking about window functions but I don´t know how I can implement it.

Once I have this I need to create a new field to indicate the end date of the record just filling the field in case that the actitivity field is equal to false just substracting 1 day to the snap date of the latest date for each set of rows with the same PK. I would need something like this:

A B C Snap activity end
1 2 3 2019-12-29 false 2019-12-30
1 2 4 2019-12-31 true