How to apply windowing function in pyspark over grouped data which needs an aggregation within an aggregation?

Question

I have a complicated winodwing operation which I need help with in pyspark.

I have some data grouped by src and dest, and I need to do the following operations for each group: - select only rows with amounts in socket2 which do not appear in socket1 (for all rows in this group) - after applying that filtering criteria, sum amounts in amounts field

amounts src dest socket1 socket2 10 1 2 A B 11 1 2 B C 12 1 2 C D 510 1 2 C D 550 1 2 B C 500 1 2 A B 80 1 3 A B

And I want to aggregate it in the following way:
512+10 = 522, and 80 is the only record for src=1 and dest=3

amounts src dest 522 1 2 80 1 3

I borrowed the sample data from here: How to write Pyspark UDAF on multiple columns?

Did you already try anything? Can you share your experiments? — Mariusz
– Mariusz, Commented Sep 14, 2017 at 21:31

MaFF · Accepted Answer · 2017-09-15 08:29:01Z

You can split your dataframe into 2 dataframes one with socket1 and the other one with socket2and then use a leftanti join instead of filtering (works for spark >= 2.0).

First let's create the dataframe:

df = spark.createDataFrame( sc.parallelize([ [10,1,2,"A","B"], [11,1,2,"B","C"], [12,1,2,"C","D"], [510,1,2,"C","D"], [550,1,2,"B","C"], [500,1,2,"A","B"], [80,1,3,"A","B"] ]), ["amounts","src","dest","socket1","socket2"] )

And now to split the dataframe :

Spark >= 2.0

df1 = df.withColumnRenamed("socket1", "socket").drop("socket2") df2 = df.withColumnRenamed("socket2", "socket").drop("socket1") res = df2.join(df1, ["src", "dest", "socket"], "leftanti")

Spark 1.6

df1 = df.withColumnRenamed("socket1", "socket").drop("socket2").withColumnRenamed("amounts", "amounts1") df2 = df.withColumnRenamed("socket2", "socket").drop("socket1") res = df2.join(df1.alias("df1"), ["src", "dest", "socket"], "left").filter("amounts1 IS NULL").drop("amounts1")

And finally the aggregation:

import pyspark.sql.functions as psf res.groupBy("src", "dest").agg( psf.sum("amounts").alias("amounts") ).show() +---+----+-------+ |src|dest|amounts| +---+----+-------+ | 1| 3| 80| | 1| 2| 522| +---+----+-------+

Collectives™ on Stack Overflow

How to apply windowing function in pyspark over grouped data which needs an aggregation within an aggregation?

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related