So I want to count the number of nulls in a dataframe by row.
Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.
For example, a subset:
columns = ['id', 'item1', 'item2', 'item3'] vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)] df=spark.createDataFrame(vals,columns) df.show() +---+-----+-----+-----+ | id|item1|item2|item3| +---+-----+-----+-----+ | 1| 2| 'A'| null| | 2| null| 1| null| | 3| null| 9| 'C'| +---+-----+-----+-----+ After running the code, the desired output is:
+---+-----+-----+-----+--------+ | id|item1|item2|item3|numNulls| +---+-----+-----+-----+--------+ | 1| 2| 'A'| null| 1| | 2| null| 1| null| 2| | 3| null| 9| 'C'| 1| +---+-----+-----+-----+--------+ EDIT: Not all non null values are ints.