Pyspark Count includes Nulls

Question

Running a simple example -

dept = [("Finance",10),("Marketing",None),("Sales",30),("IT",40)] deptColumns = ["dept_name","dept_id"] rdd = sc.parallelize(dept) df = rdd.toDF(deptColumns) df.show(truncate=False) print('count the dept_id, should be 3') print('count: ' + str(df.select(F.col("dept_id")).count()))

We get the following output -

+---------+-------+ |dept_name|dept_id| +---------+-------+ |Finance |10 | |Marketing|null | |Sales |30 | |IT |40 | +---------+-------+ count the dept_id, should be 3 count: 4

I'm running on databricks and this is my stack - Spark 3.0.1 scala 2.12, DBR 7.3 LTS

Thanks for any help!!

werner · Accepted Answer · 2021-05-08 17:28:40Z

There is a subtle difference between the count function of the Dataframe API and the count function of Spark SQL. The first one simply counts the rows while the second one can ignore null values.

You are using Dataframe.count(). According to the documentation, this function

returns the number of rows in this DataFrame

So the result 4 is correct as there are 4 rows in the dataframe.

If null values should be ignored, you can use the Spark SQL function count which can ignore null values:

count(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.

For example

df.selectExpr("count(dept_id)").show()

returns 3.

I've used this - which apparantly includes non-null as well: spark.apache.org/docs/3.1.1/api/python/reference/api/…

Shahar Katz · Accepted Answer · 2021-05-08 18:23:50Z

Another alternative solution to @werner is using the pyspark.sql.functions

from pyspark.sql import functions as F print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect()))

everfight · Accepted Answer · 2021-05-08 09:28:26Z

0

Your code is not complete. Maybe you can just add 'isNotNull()' before count()

My code like this:

from pyspark.sql.functions import col, count print('count the dept_id, should be 3') print('count: ' + str(df.filter(col("dept_id").isNotNull()).count()))

answered May 8, 2021 at 9:28

everfight

4353 silver badges10 bronze badges

2 Comments

Shahar Katz Over a year ago

The code should count 3 entries (discarding the null one), it counts to 4. also tried this and it still counts to 4 ``` print('count in spark') print(df.select( F.when(F.col('dept_id').isNull(), True).otherwise(None) ).count()) ```

Shahar Katz Over a year ago

I don't want to use filter because it works on the df itself, and i'm going to use this in aggregate function on a single column and there's no filter on a column

Collectives™ on Stack Overflow

Pyspark Count includes Nulls

3 Answers 3

1 Comment

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Related