7

I want to sum different columns in a spark dataframe.

Code

from pyspark.sql import functions as F cols = ["A.p1","B.p1"] df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols) # 1. Works df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #2. Doesnt work df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]])) #3. Doesnt work df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"]))) 

Why isn't approach #2. & #3. not working? I am on Spark 2.2

3 Answers 3

12

Because,

# 1. Works df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]])) 

Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum

#2. Doesnt work df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]])) 

Here you are using pyspark sum function which takes column as input but you are trying to get it at row level. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum

#3. Doesnt work df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"]))) 

Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks.Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it?
If you are using only two columns as mentioned, you can sum it straightaway, df.withColumn('sum1',df['A.p1']+df['B.p1']). But if there are many columns, can use UDF.
I want to pass n columns in a list and sum it. my list of columns will change based on user input
btw, why native python sum() is not benefiting?
its not a pypark function so it wont be really be completely benefiting from spark right.
|
3

TL;DR builtins.sum is just fine.


Following your comments:

Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it

and

its not a pypark function so it wont be really be completely benefiting from spark right.

I can see you are making incorrect assumptions.

Let's decompose the problem:

[df[col] for col in ["`A.p1`","`B.p1`"]] 

creates a list of Columns:

[Column<b'A.p1'>, Column<b'B.p1'>] 

Let's call it iterable.

sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:

accum = iterable[0] for element in iterable[1:]: accum = accum + element 

This gives Column:

Column<b'(A.p1 + B.p1)'> 

which is the same as calling

df["`A.p1`"] + df["`B.p1`"] 

No data has been touched and when evaluated it is benefits from all Spark optimizations.

Comments

0

Addition of multiple columns from a list into one column

I tried a lot of methods and the following are my observations:

  1. PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
  2. Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)

In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.

So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.

from pyspark.sql.functions import expr cols_list = ['a', 'b', 'c'] # Creating an addition expression using `join` expression = '+'.join(cols_list) df = df.withColumn('sum_cols', expr(expression)) 

This gives us the desired sum of columns. We can also use any other complex expression to get other output.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.