11

So I want to count the number of nulls in a dataframe by row.

Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.

For example, a subset:

columns = ['id', 'item1', 'item2', 'item3'] vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)] df=spark.createDataFrame(vals,columns) df.show() +---+-----+-----+-----+ | id|item1|item2|item3| +---+-----+-----+-----+ | 1| 2| 'A'| null| | 2| null| 1| null| | 3| null| 9| 'C'| +---+-----+-----+-----+ 

After running the code, the desired output is:

+---+-----+-----+-----+--------+ | id|item1|item2|item3|numNulls| +---+-----+-----+-----+--------+ | 1| 2| 'A'| null| 1| | 2| null| 1| null| 2| | 3| null| 9| 'C'| 1| +---+-----+-----+-----+--------+ 

EDIT: Not all non null values are ints.

2 Answers 2

21

Convert null to 1 and others to 0 and then sum all the columns:

df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show() +---+-----+-----+-----+--------+ | id|item1|item2|item3|numNulls| +---+-----+-----+-----+--------+ | 1| 2| 0| null| 1| | 2| null| 1| null| 2| | 3| null| 9| 1| 1| +---+-----+-----+-----+--------+ 
Sign up to request clarification or add additional context in comments.

6 Comments

The values are not actually always ints, I have updated the question to reflect that.
The answer doesn’t assume ints. It checks null generally and if it’s null replace the value as 1 otherwise 0. And then do the sum.
Works perfect. Thanks.
I get the error TypeError: 'Column' object is not callable for the same example
So I was using pyspark.sql.functions.sum instead of Python sum which caused the problem for me. More about the difference here
|
0
# Create new dataFrame with only 'id' column and 'numNulls'(which count all null values by row) columns # To create new dataFrame first convert old dataFrame into RDD and perform following operation and again convert it into DataFrame df2 = df.rdd.map(lambda x: (x[0], x.count(None))).toDF(['id','numNulls']) df2.show() +---+--------+ | id|numNulls| +---+--------+ | 1| 1| | 2| 2| | 3| 1| +---+--------+ # now join old dataFrame and new dataFrame on the basis of 'id' column df3 = df.join(df2, df.id == df2.id, 'inner').drop(df2.id) df3.show() +---+-----+-----+-----+--------+ | id|item1|item2|item3|numNulls| +---+-----+-----+-----+--------+ | 1| 2| A| null| 1| | 2| null| 1| null| 2| | 3| null| 9| C| 1| +---+-----+-----+-----+--------+ 

1 Comment

Please add your answer in textual format. Happy coding!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.