1

PySpark Dataframe Groupby and Count Null Values

Referring to the solution link above, I am trying to apply the same logic but groupby("country") and getting the null count of another column, and I am getting a "column is not iterable" failure. Can someone help with this?

df7.groupby("country").agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns)) 
1
  • did you import sum from pyspark sql functions? Commented May 17, 2021 at 7:43

1 Answer 1

1
covid_india_df.select( [ funcs.count( funcs.when((funcs.isnan(clm) | funcs.col(clm).isNull()), clm) ).alias(clm) for clm in covid_india_df.columns ] ).show() 

The above approach may help you to get correct results. Check here for a complete example.

Sign up to request clarification or add additional context in comments.

1 Comment

In this case it worked when summing the entire column but i needed it to be grouped by year, so i tried the following, df7.groupby("country").agg(F.count(F.when((F.isnan(c) | F.col(c).isNull()), c)).alias(c) for c in columns).show() and in this case i received "all exprs should be Column" error

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.