Whats is the correct way to sum different dataframe columns in a list in pyspark?

Question

I want to sum different columns in a spark dataframe.

Code

from pyspark.sql import functions as F cols = ["A.p1","B.p1"] df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols) # 1. Works df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #2. Doesnt work df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #3. Doesnt work df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Why isn't approach #2. & #3. not working? I am on Spark 2.2

Suresh · Accepted Answer · 2017-12-07 10:11:05Z

Because,

# 1. Works df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum

#2. Doesnt work df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum

#3. Doesnt work df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))

Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.

Thanks.Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it?
If you are using only two columns as mentioned, you can sum it straightaway, df.withColumn('sum1',df['A.p1']+df['B.p1']). But if there are many columns, can use UDF.
I want to pass n columns in a list and sum it. my list of columns will change based on user input
its not a pypark function so it wont be really be completely benefiting from spark right.

Alper t. Turker · Accepted Answer · 2017-12-07 16:00:57Z

TL;DR builtins.sum is just fine.

Following your comments:

Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it

and

its not a pypark function so it wont be really be completely benefiting from spark right.

I can see you are making incorrect assumptions.

Let's decompose the problem:

[df[col] for col in ["`A.p1`","`B.p1`"]]

creates a list of Columns:

[Column<b'A.p1'>, Column<b'B.p1'>]

Let's call it iterable.

sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:

accum = iterable[0] for element in iterable[1:]: accum = accum + element

This gives Column:

Column<b'(A.p1 + B.p1)'>

which is the same as calling

df["`A.p1`"] + df["`B.p1`"]

No data has been touched and when evaluated it is benefits from all Spark optimizations.

Vivek Payasi · Accepted Answer · 2020-04-23 06:13:05Z

Addition of multiple columns from a list into one column

I tried a lot of methods and the following are my observations:

PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)

In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.

So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.

from pyspark.sql.functions import expr cols_list = ['a', 'b', 'c'] # Creating an addition expression using `join` expression = '+'.join(cols_list) df = df.withColumn('sum_cols', expr(expression))

This gives us the desired sum of columns. We can also use any other complex expression to get other output.

Collectives™ on Stack Overflow

Whats is the correct way to sum different dataframe columns in a list in pyspark?

3 Answers 3

6 Comments

Comments

Addition of multiple columns from a list into one column

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Addition of multiple columns from a list into one column

Comments

Linked

Related