I have dataframe, I need to count number of non zero columns by row in Pyspark.
ID COL1 COL2 COL3 1 0 1 -1 2 0 0 0 3 -17 20 15 4 23 1 0 Expected Output:
ID COL1 COL2 COL3 Count 1 0 1 -1 2 2 0 0 0 0 3 -17 20 15 3 4 23 1 0 1 I have dataframe, I need to count number of non zero columns by row in Pyspark.
ID COL1 COL2 COL3 1 0 1 -1 2 0 0 0 3 -17 20 15 4 23 1 0 Expected Output:
ID COL1 COL2 COL3 Count 1 0 1 -1 2 2 0 0 0 0 3 -17 20 15 3 4 23 1 0 1 There are various approaches to achieve this, below I am listing one of the simple approaches -
df = sqlContext.createDataFrame([ [1, 0, 1, -1], [2, 0, 0, 0], [3, -17, 20, 15], [4, 23, 1, 0]], ["ID", "COL1", "COL2", "COL3"] ) #Check columns list removing ID columns df.columns[1:] ['COL1', 'COL2', 'COL3'] #import functions from pyspark.sql import functions as F #Adding new column count having sum/addition(if column !=0 then 1 else 0) df.withColumn( "count", sum([ F.when(F.col(cl) != 0, 1).otherwise(0) for cl in df.columns[1:] ]) ).show() +---+----+----+----+-----+ | ID|COL1|COL2|COL3|count| +---+----+----+----+-----+ | 1| 0| 1| -1| 2| | 2| 0| 0| 0| 0| | 3| -17| 20| 15| 3| | 4| 23| 1| 0| 2| +---+----+----+----+-----+ sum & F.sum. F.sum wouldn't work in this situation. As we need sum at row level instead column level. F.sum works when we need to sum column.pyspark.sql import functions as F then this is good. Else if you are importing like from pyspark.sql.functions import * or pyspark.sql.functions import sum, col, when then you will get exception you mentioned above.tmp2 = tmp2.select('*', expr(f"sum([when(col(cl) > 0, 1) for cl in {triglist}])") ) ; .withColumn('num_unq_triggers', expr(count([when(col(cl) > 0, 1) for cl in triglist])) ) ; .withColumn('num_unq_triggers', expr(sum([when(col(cl) > 0, 1).otherwise(0) for cl in triglist])) ) I've tried putting quotations marks inside the brackets for 'expr' function, that also doesnt work. Also tried all sorts of combinations of f.when or f.sum.