Pyspark - Count non zero columns in a spark data frame for each row

Question

I have dataframe, I need to count number of non zero columns by row in Pyspark.

ID COL1 COL2 COL3 1 0 1 -1 2 0 0 0 3 -17 20 15 4 23 1 0

Expected Output:

ID COL1 COL2 COL3 Count 1 0 1 -1 2 2 0 0 0 0 3 -17 20 15 3 4 23 1 0 1

what you tried?

Rakesh Kumar
– Rakesh Kumar

2019-05-02 04:45:07 +00:00
Commented May 2, 2019 at 4:45 — Rakesh Kumar
– Rakesh Kumar, Commented May 2, 2019 at 4:45
input and expected output mismatches.

Athar
– Athar

2019-05-02 04:58:30 +00:00
Commented May 2, 2019 at 4:58 — Athar
– Athar, Commented May 2, 2019 at 4:58

Rakesh Kumar · Accepted Answer · 2019-05-02 04:49:58Z

9

There are various approaches to achieve this, below I am listing one of the simple approaches -

df = sqlContext.createDataFrame([ [1, 0, 1, -1], [2, 0, 0, 0], [3, -17, 20, 15], [4, 23, 1, 0]], ["ID", "COL1", "COL2", "COL3"] ) #Check columns list removing ID columns df.columns[1:] ['COL1', 'COL2', 'COL3'] #import functions from pyspark.sql import functions as F #Adding new column count having sum/addition(if column !=0 then 1 else 0) df.withColumn( "count", sum([ F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:] ]) ).show() +---+----+----+----+-----+ | ID|COL1|COL2|COL3|count| +---+----+----+----+-----+ | 1| 0| 1| -1| 2| | 2| 0| 0| 0| 0| | 3| -17| 20| 15| 3| | 4| 23| 1| 0| 2| +---+----+----+----+-----+

answered May 2, 2019 at 4:49

Rakesh Kumar

4,4522 gold badges19 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

VivekDoudagiri Over a year ago

I keep getting column not iterable when I used the above code. But I changed the code to: row_Sum_not_0 = (reduce(add,(when(column(x)!=0,1).otherwise(0) for x in df.columns[1:]))).alias("count") df.select(row_Sum_not_0).show()

Rakesh Kumar Over a year ago

@VivekReddy This is also the same thing. and Above example is tested on spark v1.6, it is executing without and exception. Probably you are missing something

Rakesh Kumar Over a year ago

@jxc There is a difference between sum & F.sum. F.sum wouldn't work in this situation. As we need sum at row level instead column level. F.sum works when we need to sum column.

Rakesh Kumar Over a year ago

@VivekReddy How are you importing functions? If you are importing like pyspark.sql import functions as F then this is good. Else if you are importing like from pyspark.sql.functions import * or pyspark.sql.functions import sum, col, when then you will get exception you mentioned above.

GenDemo Over a year ago

I am having the same issue, but none of these suggestions work - I keep getting 'Column not iterable' I've tried so many combinations, in particular: tmp2 = tmp2.select('*', expr(f"sum([when(col(cl) > 0, 1) for cl in {triglist}])") ) ; .withColumn('num_unq_triggers', expr(count([when(col(cl) > 0, 1) for cl in triglist])) ) ; .withColumn('num_unq_triggers', expr(sum([when(col(cl) > 0, 1).otherwise(0) for cl in triglist])) ) I've tried putting quotations marks inside the brackets for 'expr' function, that also doesnt work. Also tried all sorts of combinations of f.when or f.sum.

Collectives™ on Stack Overflow

Pyspark - Count non zero columns in a spark data frame for each row

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related