3

Running a simple example -

dept = [("Finance",10),("Marketing",None),("Sales",30),("IT",40)] deptColumns = ["dept_name","dept_id"] rdd = sc.parallelize(dept) df = rdd.toDF(deptColumns) df.show(truncate=False) print('count the dept_id, should be 3') print('count: ' + str(df.select(F.col("dept_id")).count())) 

We get the following output -

+---------+-------+ |dept_name|dept_id| +---------+-------+ |Finance |10 | |Marketing|null | |Sales |30 | |IT |40 | +---------+-------+ count the dept_id, should be 3 count: 4 

I'm running on databricks and this is my stack - Spark 3.0.1 scala 2.12, DBR 7.3 LTS

Thanks for any help!!

3 Answers 3

2

There is a subtle difference between the count function of the Dataframe API and the count function of Spark SQL. The first one simply counts the rows while the second one can ignore null values.

You are using Dataframe.count(). According to the documentation, this function

returns the number of rows in this DataFrame

So the result 4 is correct as there are 4 rows in the dataframe.

If null values should be ignored, you can use the Spark SQL function count which can ignore null values:

count(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.

For example

df.selectExpr("count(dept_id)").show() 

returns 3.

Sign up to request clarification or add additional context in comments.

1 Comment

I've used this - which apparantly includes non-null as well: spark.apache.org/docs/3.1.1/api/python/reference/api/…
1

Another alternative solution to @werner is using the pyspark.sql.functions

from pyspark.sql import functions as F print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect())) 

Comments

0

Your code is not complete. Maybe you can just add 'isNotNull()' before count()

My code like this:

from pyspark.sql.functions import col, count print('count the dept_id, should be 3') print('count: ' + str(df.filter(col("dept_id").isNotNull()).count())) 

2 Comments

The code should count 3 entries (discarding the null one), it counts to 4. also tried this and it still counts to 4 ``` print('count in spark') print(df.select( F.when(F.col('dept_id').isNull(), True).otherwise(None) ).count()) ```
I don't want to use filter because it works on the df itself, and i'm going to use this in aggregate function on a single column and there's no filter on a column

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.