11

I have a Spark Dataframe of the following form:

+------+-------+-----+--------+ | Year | Month | Day | Ticker | +------+-------+-----+--------+ 

I am trying to group all of the values by "year" and count the number of missing values in each column per year.

I found the following snippet (forgot where from):

df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show() 

This works perfectly when calculating the number of missing values per column. However, I'm not sure how I would modify this to calculate the missing values per year.

Any pointers in the right direction would be much appreciated.

1 Answer 1

10

You can just use the same logic and add a groupby. Note that I also removed "year" from the aggregated columns, but that's optional (you would get two 'year' columns).

columns = filter(lambda x: x != "year", df.columns) df.groupBy("year")\ .agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))\ .show() 
Sign up to request clarification or add additional context in comments.

4 Comments

Cheers; I tried this but received an error before, not sure why.
Ah, my bad, I had not replaced "select" with "agg" which was causing the error. Thank you for the help :-)
You don't even need the explicit .cast('int'), though it may help readability. If you omit it, an implicit cast will happen and will give the same result.
It turns out we need .cast('int')... apologies