0

I have a data frame:

+---------+---------------------+ | id| Name| +---------+---------------------+ | 1| 'Gary'| | 1| 'Danny'| | 2| 'Christopher'| | 2| 'Kevin'| +---------+---------------------+ 

I need to combine all the Name values in the id column. Please tell me how to get from it:

+---------+------------------------+ | id| Name| +---------+------------------------+ | 1| ['Gary', 'Danny']| | 2| ['Kevin','Christopher']| +---------+------------------------+ 

3 Answers 3

2

You can use groupBy and collect functions. Based on your need you can use list or set etc.

df.groupBy(col("id")).agg(collect_list(col("Name")) 

in case you want duplicate values

df.groupBy(col("id")).agg(collect_set(col("Name")) 

if you want unique values

Sign up to request clarification or add additional context in comments.

Comments

2

Use groupBy and collect_list functions for this case.

from pyspark.sql.functions import * df.groupBy(col("id")).agg(collect_list(col("Name")).alias("Name")).show(10,False) #+---+------------------------+ #|id |Name | #+---+------------------------+ #|1 |['Gary', 'Danny'] | #|2 |['Kevin', 'Christopher']| #+---+------------------------+ 

Comments

0
df.groupby('id')['Name'].apply(list) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.