groupby and convert multiple columns into a list using pyspark

Question

I'm using pyspark. So I have a spark dataframe that looks like:

a | b | c 5 | 2 | 1 5 | 4 | 3 2 | 4 | 2 2 | 3 | 7

Need Output:

a | b_list 5 | 2,1,4,3 2 | 4,2,3,7

It's important to keep the sequence as given in output.

@ErnestKiwele Didn't understand your question, but I want to groupby on column a, and get b,c into a list as given in the output. In pandas, it's a one line answer, I can't figure out in pyspark. — YOLO
– YOLO, Commented Apr 28, 2018 at 19:51

YOLO · Accepted Answer · 2020-01-29 14:04:01Z

Instead of udf, for joining the list, we can also use concat_ws function as suggested in comments above, like this:

import pyspark.sql.functions as F df = (df .withColumn('lst', F.concat(df['b'], F.lit(','), df['c']).alias('lst')) .groupBy('a') .agg( F.concat_ws(',', F.collect_list('lst').alias('b_list')).alias('lst'))) df.show() +---+-------+ | a| lst| +---+-------+ | 5|2,1,4,3| | 2|4,2,3,7| +---+-------+

ernest_k · Accepted Answer · 2018-04-28 20:18:21Z

The following results in the last 2 columns aggregated into an array column:

df1 = df.withColumn('lst', f.concat(df['b'], f.lit(','), df['c']).alias('lst'))\ .groupBy('a')\ .agg( f.collect_list('lst').alias('b_list'))

Now join array elements:

#Simplistic udf to joing array: def join_array(col): return ','.join(col) join = f.udf(join_array) df1.select('a', join(df1['b_list']).alias('b_list'))\ .show()

Printing:

+---+-------+ | a| b_list| +---+-------+ | 5|2,1,4,3| | 2|4,2,3,7| +---+-------+

You could use pyspark.sql.functions.concat_ws to do the join which will be faster than using a udf.
@pault thanks. Not sure I misread, but when I first looked at it, it seemed to want string columns as input, but I had arrays to pass in. Will take another look when I get some time...
You can pass an array (like the output of collect_list) to concat_ws - for example, take a look at this answer.

Collectives™ on Stack Overflow

groupby and convert multiple columns into a list using pyspark

2 Answers 2

Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Linked

Related