How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Question

import numpy as np data = [ (1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float("nan")), (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))

Expected output

dataframe with count of nan/null for each column

Note: The previous questions I found in stack overflow only checks for null & not nan. That's why I have created a new question.

I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?

Is there any solution for scala ?

DachuanZhao
– DachuanZhao

2022-02-16 04:45:06 +00:00
Commented Feb 16, 2022 at 4:45 — DachuanZhao
– DachuanZhao, Commented Feb 16, 2022 at 4:45

user8183279 · Accepted Answer · 2017-06-19 13:15:19Z

214

You can use method shown here and replace isNull with isnan:

from pyspark.sql.functions import isnan, when, count, col df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ | 0| 0| 3| +-------+----------+---+

or

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() +-------+----------+---+ |session|timestamp1|id2| +-------+----------+---+ | 0| 0| 5| +-------+----------+---+

answered Jun 19, 2017 at 13:15

user8183279

2,1561 gold badge11 silver badges2 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

titiro89 Over a year ago

isNull vs isnan. These two links will help you. "isnan()" is a function of the pysparq.sql.function package, so you have to set which column you want to use as an argument of the function. "isNull()" belongs to pyspark.sql.Column package, so what you have to do is "yourColumn.isNull()"

user5751943 Over a year ago

I am getting an error with this df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() - Is there any library I need to import. The error I am getting is illegal start of simple expression.

Eric Bellet Over a year ago

This solution does not work for timestamp columns

elcombato Over a year ago

@EricBellet for timestamp columns you can utilized df.dtypes: df.select([f.count(f.when(f.isnan(c), c)).alias(c) for c, t in df.dtypes if t != "timestamp"]).show()

Anthony Awuley Over a year ago

scala equivalent: df.select(df.columns.map(c => count(when(isnan(col(c)), c)).alias(c)):_*)

|

Vamsi Krishna · Accepted Answer · 2020-11-01 18:25:19Z

For null values in the dataframe of pyspark

Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns} Dict_Null # The output in dict where key is column name and value is null values in that column {'#': 0, 'Name': 0, 'Type 1': 0, 'Type 2': 386, 'Total': 0, 'HP': 0, 'Attack': 0, 'Defense': 0, 'Sp_Atk': 0, 'Sp_Def': 0, 'Speed': 0, 'Generation': 0, 'Legendary': 0}

gench · Accepted Answer · 2020-01-08 13:41:02Z

To make sure it does not fail for string, date and timestamp columns:

import pyspark.sql.functions as F def count_missings(spark_df,sort=True): """ Counts number of nulls and nans in each column """ df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas() if len(df) == 0: print("There are no any missing values!") return None if sort: return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False) return df

If you want to see the columns sorted based on the number of nans and nulls in descending:

count_missings(spark_df) # | Col_A | 10 | # | Col_C | 2 | # | Col_B | 1 |

If you don't want ordering and see them as a single row:

count_missings(spark_df, False) # | Col_A | Col_B | Col_C | # | 10 | 1 | 2 |

This function is computationally expensive for large datasets.
Dangerous, because silently ignores Null in any of the excluded types.
Fails on PySpark 3 with 'float' object has no attribute 'tzinfo'.

Kubra Altun · Accepted Answer · 2021-03-18 13:49:49Z

9

An alternative to the already provided ways is to simply filter on the column like so

import pyspark.sql.functions as F df = df.where(F.col('columnNameHere').isNull())

This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.

edited Mar 18, 2021 at 13:49

Kubra Altun

4155 silver badges15 bronze badges

answered Nov 3, 2020 at 8:45

Patrik Iselind

3272 silver badges11 bronze badges

2 Comments

pauljohn32 Over a year ago

Overwrites df, maybe that not intended. OP asks for count, should probably be x.where(col(colname).isNull()).count() for x a dframe and colname a string.

Patrik Iselind Over a year ago

I might be missing something @pauljohn32 but it seems to me that your suggestion is exactly the same as my response, you just added the call to count() at the end. I think I am clear in my response that my code snippet shows how to do the filtering. Adding df.count() at the end should be considered a trivial addition. No?

Marioanzas · Accepted Answer · 2021-04-21 04:48:29Z

Here is my one liner. Here 'c' is the name of the column

from pyspark.sql.functions import isnan, when, count, col, isNull df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()

Eric Bellet · Accepted Answer · 2021-01-20 11:56:08Z

I prefer this solution:

df = spark.table(selected_table).filter(condition) counter = df.count() df = df.select([(counter - count(c)).alias(c) for c in df.columns])

Aviad Rozenhek · Accepted Answer · 2023-03-30 09:49:21Z

here's a method that avoids any pitfalls with isnan or isNull and works with any datatype

# spark is a pyspark.sql.SparkSession object def count_nulls(df: ): cache = df.cache() row_count = cache.count() return spark.createDataFrame( [[row_count - cache.select(col_name).na.drop().count() for col_name in cache.columns]], # schema=[(col_name, 'integer') for col_name in cache.columns] schema=cache.columns )

ijoseph · Accepted Answer · 2021-10-26 01:06:59Z

from pyspark.sql import DataFrame import pyspark.sql.functions as fn # compatiable with fn.isnan. Sourced from # https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836 NUMERIC_DTYPES = ( 'decimal', 'double', 'float', 'int', 'bigint', 'smallilnt', 'tinyint', ) def count_nulls(df: DataFrame) -> DataFrame: isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)} return df.select( [fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols] + [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols] )

Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.

The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.

DivineCoder · Accepted Answer · 2022-03-07 17:16:27Z

if you are writing spark sql, then the following will also work to find null value and count subsequently.

spark.sql('select * from table where isNULL(column_value)')

Rajesh Ramachander · Accepted Answer · 2022-04-14 16:00:09Z

Yet another alternative (improved upon Vamsi Krishna's solutions above):

def check_for_null_or_nan(df): null_or_nan = lambda x: isnan(x) | isnull(x) func = lambda x: df.filter(null_or_nan(x)).count() print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

check_for_null_or_nan(df)

id2 has 5 nans/nulls

Buddhadeb Mondal · Accepted Answer · 2022-08-16 06:26:59Z

Use the following code to identify the null values in every columns using pyspark.

def check_nulls(dataframe): ''' Check null values and return the null values in pandas Dataframe INPUT: Spark Dataframe OUTPUT: Null values ''' # Create pandas dataframe nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(), columns = dataframe.columns).transpose() nulls_check.columns = ['Null Values'] return nulls_check #Check null values null_df = check_nulls(raw_df) null_df

what happens if the data is 1TB in size? dont convert to pandas that defeats the purpose of using spark in the first place

minhle_r7 · Accepted Answer · 2022-08-25 09:58:16Z

Here is a readable solution because code is for people as much as computers ;-)

df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))

Collectives™ on Stack Overflow

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

12 Answers 12

6 Comments

Comments

5 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

6 Comments

Comments

5 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Linked

Related