0

I work with a spark Dataframe and I try to create a new table with aggregation using groupby : My data example : enter image description here

and this is the desired result : enter image description here

I tried this code data.groupBy("id1").agg(countDistinct("id2").alias("id2"), sum("value").alias("value"))

Anyone can help please ? Thank you

2 Answers 2

1

Try using below code -

from pyspark.sql.functions import * df = spark.createDataFrame([('id11', 'id21', 1), ('id11', 'id22', 2), ('id11', 'id23', 3), ('id12', 'id21', 2), ('id12', 'id23', 1), ('id13', 'id23', 2), ('id13', 'id21', 8)], ["id1", "id2","value"]) 

Aggregated Data -

df.groupBy("id1").agg(count("id2"),sum("value")).show() 

Output -

+----+----------+----------+ | id1|count(id2)|sum(value)| +----+----------+----------+ |id11| 3| 6| |id12| 2| 3| |id13| 2| 10| +----+----------+----------+ 
Sign up to request clarification or add additional context in comments.

Comments

0

Here's a solution of how to groupBy with multiple columns using PySpark:

import pyspark.sql.functions as F from pyspark.sql.functions import col df.groupBy("id1").agg(F.count(col("id2")).alias('id2_count'), F.sum(col('value')).alias("value_sum")).show() 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.