turning pandas to pyspark expression

Question

I need to turn a two column Dataframe to a list grouped by one of the columns. I have done it successfully in pandas:

expertsDF = expertsDF.groupby('session', as_index=False).agg(lambda x: x.tolist())

But now I am trying to do the same thing in pySpark as follows:

expertsDF = df.groupBy('session').agg(lambda x: x.collect())

and I am getting the error:

all exprs should be Column

I have tried several commands but I simply cannot get it right. And the spark dokumentation does not contain something similar.

An example input for it would be a dataframe:

session name 1 a 1 b 2 v 2 c

output:

session name 1 [a, b....] 2 [v, c....]

Try this: from pyspark.sql.functions import *; df.groupBy('session').agg(collect_list('name')) — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Oct 22, 2016 at 16:47

MaxU - stand with Ukraine · Accepted Answer · 2016-10-22 17:14:48Z

7

You can also use pyspark.sql.functions.collect_list(col) function:

from pyspark.sql.functions import * df.groupBy('session').agg(collect_list('name'))

edited Oct 22, 2016 at 17:14

answered Oct 22, 2016 at 17:05

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mtoto · Accepted Answer · 2016-10-22 17:57:28Z

You could use reduceByKey() to do this efficiently:

(df.rdd .map(lambda x: (x[0],[x[1]])) .reduceByKey(lambda x,y: x+y) .toDF(["session", "name"]).show()) +-------+------+ |session| name| +-------+------+ | 1|[a, b]| | 2|[v, c]| +-------+------+

Data:

df = sc.parallelize([(1, "a"), (1, "b"), (2, "v"), (2, "c")]).toDF(["session", "name"])

Collectives™ on Stack Overflow

turning pandas to pyspark expression

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related