Count values by condition in PySpark Dataframe

Count values by condition in PySpark Dataframe

In PySpark, you can use the filter and count functions to count values based on a specific condition in a DataFrame. The filter function allows you to specify a condition to filter rows, and the count function will give you the number of rows after the filter is applied.

Here's a step-by-step guide on how to count values by condition in a PySpark DataFrame:

  • First, set up your PySpark environment:
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("countValuesByCondition").getOrCreate() 
  • Create a sample DataFrame:
from pyspark.sql import Row data = [Row(name="Alice", age=25), Row(name="Bob", age=30), Row(name="Charlie", age=25), Row(name="David", age=28), Row(name="Eva", age=30)] df = spark.createDataFrame(data) df.show() 
  • Count the values by condition:

Let's count the number of people with age 30:

count_age_30 = df.filter(df.age == 30).count() print(count_age_30) 

If you have multiple conditions, you can use the & (and), | (or), and ~ (not) operators:

# Count people with age 30 and name Bob count_bob_age_30 = df.filter((df.age == 30) & (df.name == "Bob")).count() print(count_bob_age_30) # Count people with age less than 30 or name Eva count_condition = df.filter((df.age < 30) | (df.name == "Eva")).count() print(count_condition) 

Remember to always wrap individual conditions in parentheses when combining multiple conditions.

  • (Optional) If you want to count based on unique values and conditions, you can use groupBy:
# Count the number of people for each age df.groupBy("age").count().show() 

This way, you can efficiently count values based on conditions in a PySpark DataFrame.


More Tags

scanf dictionary asp.net-core-routing ggplot2 x-editable expandoobject parent-child bower assembly linker-scripts

More Programming Guides

Other Guides

More Programming Examples