I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:
expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist()) But now I am trying to do the same thing in pySpark as follows:
expertsDF = df.groupBy('session').agg(lambda x: x.collect()) and I am getting the error:
all exprs should be Column I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.
An example input for it would be a dataframe:
session name 1 a 1 b 2 v 2 c output:
session name 1 [a, b....] 2 [v, c....]
from pyspark.sql.functions import *; df.groupBy('session').agg(collect_list('name'))