Pyspark groupby and count null values

Question

PySpark Dataframe Groupby and Count Null Values

Referring to the solution link above, I am trying to apply the same logic but groupby("country") and getting the null count of another column, and I am getting a "column is not iterable" failure. Can someone help with this?

df7.groupby("country").agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))

did you import sum from pyspark sql functions?

mck
– mck

2021-05-17 07:43:20 +00:00
Commented May 17, 2021 at 7:43 — mck
– mck, Commented May 17, 2021 at 7:43

Suyog Shimpi · Accepted Answer · 2021-05-17 07:21:13Z

1

covid_india_df.select( [ funcs.count( funcs.when((funcs.isnan(clm) | funcs.col(clm).isNull()), clm) ).alias(clm) for clm in covid_india_df.columns ] ).show()

The above approach may help you to get correct results. Check here for a complete example.

answered May 17, 2021 at 7:21

Suyog Shimpi

9181 gold badge11 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Arek Over a year ago

In this case it worked when summing the entire column but i needed it to be grouped by year, so i tried the following, df7.groupby("country").agg(F.count(F.when((F.isnan(c) | F.col(c).isNull()), c)).alias(c) for c in columns).show() and in this case i received "all exprs should be Column" error

Collectives™ on Stack Overflow

Pyspark groupby and count null values

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related