Pyspark dataframe: Summing over a column while grouping over another

Question

I have a dataframe such as the following

In [94]: prova_df.show() order_item_order_id order_item_subtotal 1 299.98 2 199.99 2 250.0 2 129.99 4 49.98 4 299.95 4 150.0 4 199.92 5 299.98 5 299.95 5 99.96 5 299.98

What I would like to do is to compute, for each different value of the first column, the sum over the corresponding values of the second column. I've tried doing this with the following code:

from pyspark.sql import functions as func prova_df.groupBy("order_item_order_id").agg(func.sum("order_item_subtotal")).show()

Which gives an output

SUM('order_item_subtotal) 129.99000549316406 579.9500122070312 199.9499969482422 634.819995880127 434.91000747680664

Which I'm not so sure if it's doing the right thing. Why isn't it showing also the information from the first column? Thanks in advance for your answers

zero323 · Accepted Answer · 2015-11-28 05:35:14Z

Why isn't it showing also the information from the first column?

Most likely because you're using outdated Spark 1.3.x. If thats the case you have to repeat grouping columns inside agg as follows:

(df .groupBy("order_item_order_id") .agg(func.col("order_item_order_id"), func.sum("order_item_subtotal")) .show())

Zac Roberts · Accepted Answer · 2019-09-26 17:14:10Z

A similar solution for your problem using PySpark 2.7.x would look like this:

df = spark.createDataFrame( [(1, 299.98), (2, 199.99), (2, 250.0), (2, 129.99), (4, 49.98), (4, 299.95), (4, 150.0), (4, 199.92), (5, 299.98), (5, 299.95), (5, 99.96), (5, 299.98)], ['order_item_order_id', 'order_item_subtotal']) df.groupBy('order_item_order_id').sum('order_item_subtotal').show()

Which results in the following output:

+-------------------+------------------------+ |order_item_order_id|sum(order_item_subtotal)| +-------------------+------------------------+ | 5| 999.8700000000001| | 1| 299.98| | 2| 579.98| | 4| 699.85| +-------------------+------------------------+

luminousmen · Accepted Answer · 2020-12-18 02:36:57Z

5

You can use partition in a window function for that:

from pyspark.sql import Window df.withColumn("value_field", f.sum("order_item_subtotal") \ .over(Window.partitionBy("order_item_order_id"))) \ .show()

edited Dec 18, 2020 at 2:36

user8484970

answered Jul 19, 2018 at 10:27

luminousmen

2,2091 gold badge24 silver badges24 bronze badges

2 Comments

karthik r Over a year ago

what is value_field here?

information_interchange Over a year ago

Just an arbitrary string that you want the column to be named

Collectives™ on Stack Overflow

Pyspark dataframe: Summing over a column while grouping over another

3 Answers 3

Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Linked

Related