Counting number of nulls in pyspark dataframe by row

Question

So I want to count the number of nulls in a dataframe by row.

Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.

For example, a subset:

columns = ['id', 'item1', 'item2', 'item3'] vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)] df=spark.createDataFrame(vals,columns) df.show() +---+-----+-----+-----+ | id|item1|item2|item3| +---+-----+-----+-----+ | 1| 2| 'A'| null| | 2| null| 1| null| | 3| null| 9| 'C'| +---+-----+-----+-----+

After running the code, the desired output is:

+---+-----+-----+-----+--------+ | id|item1|item2|item3|numNulls| +---+-----+-----+-----+--------+ | 1| 2| 'A'| null| 1| | 2| null| 1| null| 2| | 3| null| 9| 'C'| 1| +---+-----+-----+-----+--------+

EDIT: Not all non null values are ints.

akuiper · Accepted Answer · 2018-10-17 23:09:43Z

21

Convert null to 1 and others to 0 and then sum all the columns:

df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show() +---+-----+-----+-----+--------+ | id|item1|item2|item3|numNulls| +---+-----+-----+-----+--------+ | 1| 2| 0| null| 1| | 2| null| 1| null| 2| | 3| null| 9| 1| 1| +---+-----+-----+-----+--------+

answered Oct 17, 2018 at 23:09

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

tormond Over a year ago

The values are not actually always ints, I have updated the question to reflect that.

akuiper Over a year ago

The answer doesn’t assume ints. It checks null generally and if it’s null replace the value as 1 otherwise 0. And then do the sum.

tormond Over a year ago

Works perfect. Thanks.

Ali Over a year ago

I get the error TypeError: 'Column' object is not callable for the same example

Ali Over a year ago

So I was using pyspark.sql.functions.sum instead of Python sum which caused the problem for me. More about the difference here

|

KIRAN JAGTAP · Accepted Answer · 2023-04-10 17:50:36Z

# Create new dataFrame with only 'id' column and 'numNulls'(which count all null values by row) columns # To create new dataFrame first convert old dataFrame into RDD and perform following operation and again convert it into DataFrame df2 = df.rdd.map(lambda x: (x[0], x.count(None))).toDF(['id','numNulls']) df2.show() +---+--------+ | id|numNulls| +---+--------+ | 1| 1| | 2| 2| | 3| 1| +---+--------+ # now join old dataFrame and new dataFrame on the basis of 'id' column df3 = df.join(df2, df.id == df2.id, 'inner').drop(df2.id) df3.show() +---+-----+-----+-----+--------+ | id|item1|item2|item3|numNulls| +---+-----+-----+-----+--------+ | 1| 2| A| null| 1| | 2| null| 1| null| 2| | 3| null| 9| C| 1| +---+-----+-----+-----+--------+

Collectives™ on Stack Overflow

Counting number of nulls in pyspark dataframe by row

2 Answers 2

6 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Linked

Related