5

I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:

expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist()) 

But now I am trying to do the same thing in pySpark as follows:

expertsDF = df.groupBy('session').agg(lambda x: x.collect()) 

and I am getting the error:

all exprs should be Column 

I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.

An example input for it would be a dataframe:

session name 1 a 1 b 2 v 2 c 

output:

session name 1 [a, b....] 2 [v, c....] 
3
  • can you share example data and expected output please? Commented Oct 22, 2016 at 16:32
  • @mtoto yes sure, done! Commented Oct 22, 2016 at 16:44
  • 2
    Try this: from pyspark.sql.functions import *; df.groupBy('session').agg(collect_list('name')) Commented Oct 22, 2016 at 16:47

2 Answers 2

7

You can also use pyspark.sql.functions.collect_list(col) function:

from pyspark.sql.functions import * df.groupBy('session').agg(collect_list('name')) 
Sign up to request clarification or add additional context in comments.

Comments

1

You could use reduceByKey() to do this efficiently:

(df.rdd .map(lambda x: (x[0],[x[1]])) .reduceByKey(lambda x,y: x+y) .toDF(["session", "name"]).show()) +-------+------+ |session| name| +-------+------+ | 1|[a, b]| | 2|[v, c]| +-------+------+ 

Data:

df = sc.parallelize([(1, "a"), (1, "b"), (2, "v"), (2, "c")]).toDF(["session", "name"]) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.