PySpark Dataframe Groupby and Count Null Values

Question

I have a Spark Dataframe of the following form:

+------+-------+-----+--------+ | Year | Month | Day | Ticker | +------+-------+-----+--------+

I am trying to group all of the values by "year" and count the number of missing values in each column per year.

I found the following snippet (forgot where from):

df.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()

This works perfectly when calculating the number of missing values per column. However, I'm not sure how I would modify this to calculate the missing values per year.

Any pointers in the right direction would be much appreciated.

Derek O · Accepted Answer · 2023-03-31 15:25:27Z

10

You can just use the same logic and add a groupby. Note that I also removed "year" from the aggregated columns, but that's optional (you would get two 'year' columns).

columns = filter(lambda x: x != "year", df.columns) df.groupBy("year")\ .agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))\ .show()

edited Mar 31, 2023 at 15:25

Derek O

20.2k4 gold badges32 silver badges49 bronze badges

answered Mar 20, 2019 at 16:54

Oli

10.5k5 gold badges31 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user10691834 Over a year ago

Cheers; I tried this but received an error before, not sure why.

user10691834 Over a year ago

Ah, my bad, I had not replaced "select" with "agg" which was causing the error. Thank you for the help :-)

Andras Vanyolos Over a year ago

You don't even need the explicit .cast('int'), though it may help readability. If you omit it, an implicit cast will happen and will give the same result.

Andras Vanyolos Jul 9 at 9:28

It turns out we need .cast('int')... apologies

Collectives™ on Stack Overflow

PySpark Dataframe Groupby and Count Null Values

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related