I have a complicated winodwing operation which I need help with in pyspark.
I have some data grouped by src and dest, and I need to do the following operations for each group: - select only rows with amounts in socket2 which do not appear in socket1 (for all rows in this group) - after applying that filtering criteria, sum amounts in amounts field
amounts src dest socket1 socket2 10 1 2 A B 11 1 2 B C 12 1 2 C D 510 1 2 C D 550 1 2 B C 500 1 2 A B 80 1 3 A B And I want to aggregate it in the following way:
512+10 = 522, and 80 is the only record for src=1 and dest=3
amounts src dest 522 1 2 80 1 3 I borrowed the sample data from here: How to write Pyspark UDAF on multiple columns?