How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

Question

Suppose I have a pyspark data frame as:

col1 col2 col3 1 2 -3 2 null 5 4 4 8 1 0 9

I want to add a column called check where it counts the number of values that are greater than 0.

The final output will be:

col1 col2 col3 check 1 2 -3 2 2 null 5 2 4 4 8 3 1 0 9 2

I was trying this. But, it didn't help and errors out as below:

df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))

Invalid argument, not a string or column: <generator object at 0x7f0a866ae580> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Kashyap · Accepted Answer · 2022-02-01 23:01:08Z

Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.

count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType()) df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()

Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.

Idea: https://stackoverflow.com/a/42540401/496289

Your code shows you doing a sum of non-zero columns, not count. If you need sum then

count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())

This didn't work. It throws an error when trying to display: PythonException: 'TypeError: '>' not supported between instances of 'NoneType' and 'int'', from <command-1708360697702947>, line 3. Full traceback below:
@Messy, "Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed."

notNull · Accepted Answer · 2022-02-02 00:30:05Z

Create a new column array and filter the newly created column finally count the elements in the column.

Example:

df.show(10,False) #+----+----+----+ #|col1|col2|col3| #+----+----+----+ #|1 |2 |-3 | #|2 |null|5 | #+----+----+----+ df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False) #+----+----+----+-----+ #|col1|col2|col3|check| #+----+----+----+-----+ #|1 |2 |-3 |2 | #|2 |null|5 |1 | #+----+----+----+-----+

blackbishop · Accepted Answer · 2022-02-02 11:26:39Z

You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:

from pyspark.sql import functions as F from operator import add from functools import reduce df = spark.createDataFrame([ (1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9) ], ["col1", "col2", "col3"]) df = df.withColumn( "check", reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns]) ) df.show() #+----+----+----+-----+ #|col1|col2|col3|check| #+----+----+----+-----+ #| 1| 2| -3| 2| #| 2|null| 5| 2| #| 4| 4| 8| 3| #| 1| 0| 9| 2| #+----+----+----+-----+

Collectives™ on Stack Overflow

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

3 Answers 3

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Linked

Related