how to combine rows in a data frame by id

Question

I have a data frame:

+---------+---------------------+ | id| Name| +---------+---------------------+ | 1| 'Gary'| | 1| 'Danny'| | 2| 'Christopher'| | 2| 'Kevin'| +---------+---------------------+

I need to combine all the Name values in the id column. Please tell me how to get from it:

+---------+------------------------+ | id| Name| +---------+------------------------+ | 1| ['Gary', 'Danny']| | 2| ['Kevin','Christopher']| +---------+------------------------+

mani_nz · Accepted Answer · 2020-05-01 14:52:13Z

You can use groupBy and collect functions. Based on your need you can use list or set etc.

df.groupBy(col("id")).agg(collect_list(col("Name"))

in case you want duplicate values

df.groupBy(col("id")).agg(collect_set(col("Name"))

if you want unique values

notNull · Accepted Answer · 2020-05-01 14:57:43Z

Use groupBy and collect_list functions for this case.

from pyspark.sql.functions import * df.groupBy(col("id")).agg(collect_list(col("Name")).alias("Name")).show(10,False) #+---+------------------------+ #|id |Name | #+---+------------------------+ #|1 |['Gary', 'Danny'] | #|2 |['Kevin', 'Christopher']| #+---+------------------------+

Saurabh Kansal · Accepted Answer · 2020-05-01 14:52:53Z

0

df.groupby('id')['Name'].apply(list)

answered May 1, 2020 at 14:52

Saurabh Kansal

7027 silver badges14 bronze badges

Collectives™ on Stack Overflow

how to combine rows in a data frame by id

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related