I have a pyspark dataframe. I have to do a group by and then aggregate certain columns into a list so that I can apply a UDF on the data frame.
As an example, I have created a dataframe and then grouped by person.
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"]) df = df.groupby("Person").agg(F.collect_list(F.struct("Amount", "Budget", "Date")).alias("data")) df.show(truncate=False) +------+----------------------------------------------------------------------------+ |Person|data | +------+----------------------------------------------------------------------------+ |Bob |[[85.8,Food,2017-09-13], [7.8,Household,2017-09-13], [6.52,Food,2017-06-13]]| +------+----------------------------------------------------------------------------+ I have left out the UDF but the resulting data frame from the UDF is below.
+------+--------------------------------------------------------------+ |Person|res | +------+--------------------------------------------------------------+ |Bob |[[562,Food,June,1], [380,Household,Sept,4], [880,Food,Sept,2]]| +------+--------------------------------------------------------------+ I need to convert the resulting dataframe into rows where each element in list is a new row with a new column. This can be seen below.
+------+------------------------------+ |Person|Amount|Budget |Month|Cluster| +------+------------------------------+ |Bob |562 |Food |June |1 | |Bob |380 |Household|Sept |4 | |Bob |880 |Food |Sept |2 | +------+------------------------------+