1

I have dataframe, I need to count number of non zero columns by row in Pyspark.

ID COL1 COL2 COL3 1 0 1 -1 2 0 0 0 3 -17 20 15 4 23 1 0 

Expected Output:

ID COL1 COL2 COL3 Count 1 0 1 -1 2 2 0 0 0 0 3 -17 20 15 3 4 23 1 0 1 
2
  • what you tried? Commented May 2, 2019 at 4:45
  • input and expected output mismatches. Commented May 2, 2019 at 4:58

1 Answer 1

9

There are various approaches to achieve this, below I am listing one of the simple approaches -

df = sqlContext.createDataFrame([ [1, 0, 1, -1], [2, 0, 0, 0], [3, -17, 20, 15], [4, 23, 1, 0]], ["ID", "COL1", "COL2", "COL3"] ) #Check columns list removing ID columns df.columns[1:] ['COL1', 'COL2', 'COL3'] #import functions from pyspark.sql import functions as F #Adding new column count having sum/addition(if column !=0 then 1 else 0) df.withColumn( "count", sum([ F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:] ]) ).show() +---+----+----+----+-----+ | ID|COL1|COL2|COL3|count| +---+----+----+----+-----+ | 1| 0| 1| -1| 2| | 2| 0| 0| 0| 0| | 3| -17| 20| 15| 3| | 4| 23| 1| 0| 2| +---+----+----+----+-----+ 
Sign up to request clarification or add additional context in comments.

5 Comments

I keep getting column not iterable when I used the above code. But I changed the code to: row_Sum_not_0 = (reduce(add,(when(column(x)!=0,1).otherwise(0) for x in df.columns[1:]))).alias("count") df.select(row_Sum_not_0).show()
@VivekReddy This is also the same thing. and Above example is tested on spark v1.6, it is executing without and exception. Probably you are missing something
@jxc There is a difference between sum & F.sum. F.sum wouldn't work in this situation. As we need sum at row level instead column level. F.sum works when we need to sum column.
@VivekReddy How are you importing functions? If you are importing like pyspark.sql import functions as F then this is good. Else if you are importing like from pyspark.sql.functions import * or pyspark.sql.functions import sum, col, when then you will get exception you mentioned above.
I am having the same issue, but none of these suggestions work - I keep getting 'Column not iterable' I've tried so many combinations, in particular: tmp2 = tmp2.select('*', expr(f"sum([when(col(cl) > 0, 1) for cl in {triglist}])") ) ; .withColumn('num_unq_triggers', expr(count([when(col(cl) > 0, 1) for cl in triglist])) ) ; .withColumn('num_unq_triggers', expr(sum([when(col(cl) > 0, 1).otherwise(0) for cl in triglist])) ) I've tried putting quotations marks inside the brackets for 'expr' function, that also doesnt work. Also tried all sorts of combinations of f.when or f.sum.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.