combine multiple row in Spark

Question

I wonder if there is any easy way to combine multiple rows into one in Pyspark, I am new to Python and Spark and been using Spark.sql most of the time.

Here is a data example:

id count1 count2 count3 1 null 1 null 1 3 null null 1 null null 5 2 null 1 null 2 1 null null 2 null null 2

the expected output is :

 id count1 count2 count3 1 3 1 5 2 1 1 2

I been using spark SQL to join them multiple times, and wonder if there is any easier way to do that.

Thank you!

I am not sure if it was intended, but in your data, it looks like every id has only one non-null value for a column? — Shivam Mohan
– Shivam Mohan, Commented Feb 7, 2020 at 15:22
if every id has only one non-null value, you can do groupBy + first with ignorenulls =True. Something like: df.groupBy('id').agg(*[first(c, True).alias(c) for c in df.columns[1:]]) — pault
– pault, Commented Feb 7, 2020 at 15:25
Or groupBy with max : f.groupBy("id").agg(*[max(c).alias(c) for c in df.columns[1:]]).show()... — blackbishop
– blackbishop, Commented Feb 7, 2020 at 16:15
yes, only one null value. Thank you all, I will give it a try! — yokielove
– yokielove, Commented Feb 7, 2020 at 16:58

Dave · Accepted Answer · 2020-02-07 15:40:31Z

Spark SQL will sum null as zero, so if you know there are no "overlapping" data elements, just group by the column you wish aggregate to and sum.

Assuming that you want to keep your original column names (and not sum the id column), you'll need to specify the columns that are summed and then rename them after the aggregation.

before.show() +---+------+------+------+ | id|count1|count2|count3| +---+------+------+------+ | 1| null| 1| null| | 1| 3| null| null| | 1| null| null| 5| | 2| null| 1| null| | 2| 1| null| null| | 2| null| null| 2| +---+------+------+------+ after = before .groupby('id').sum(*[c for c in before.columns if c != 'id']) .select([col(f"sum({c})").alias(c) for c in before.columns if c != 'id']) after.show() +------+------+------+ |count1|count2|count3| +------+------+------+ | 3| 1| 5| | 1| 1| 2| +------+------+------+

Collectives™ on Stack Overflow

combine multiple row in Spark

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related