How to count unique ID after groupBy in PySpark Dataframe?

How to count unique ID after groupBy in PySpark Dataframe?

In PySpark, you can count unique IDs in a DataFrame after a groupBy operation by using the agg() function along with the countDistinct() function from the pyspark.sql.functions module.

Here's a step-by-step guide:

  • Import the necessary classes and functions:
from pyspark.sql import SparkSession from pyspark.sql.functions import countDistinct 
  • Create a SparkSession:
spark = SparkSession.builder.appName('UniqueIDCounter').getOrCreate() 
  • Create a DataFrame:

Assuming you have a DataFrame df which contains at least two columns: one that you want to group by and another with the IDs that you want to count uniquely.

# Sample DataFrame creation data = [ ('group1', 'id1'), ('group1', 'id2'), ('group1', 'id2'), ('group2', 'id3'), ('group2', 'id4'), ('group2', 'id4') ] df = spark.createDataFrame(data, ['group', 'id']) 
  • Group by one column and count unique IDs in the other:
df_grouped = df.groupBy('group').agg(countDistinct('id').alias('unique_ids')) 
  • Show the result:
df_grouped.show() 

This will output the number of unique IDs for each group.

Here's the full example:

from pyspark.sql import SparkSession from pyspark.sql.functions import countDistinct # Initialize Spark Session spark = SparkSession.builder.appName('UniqueIDCounter').getOrCreate() # Sample data data = [ ('group1', 'id1'), ('group1', 'id2'), ('group1', 'id2'), ('group2', 'id3'), ('group2', 'id4'), ('group2', 'id4') ] # Creating DataFrame df = spark.createDataFrame(data, ['group', 'id']) # Group by 'group' and count distinct 'id' df_grouped = df.groupBy('group').agg(countDistinct('id').alias('unique_ids')) # Show the result df_grouped.show() # Stop the Spark session spark.stop() 

Output will look something like this:

+------+----------+ | group|unique_ids| +------+----------+ |group1| 2| |group2| 2| +------+----------+ 

This indicates that for group1, there are 2 unique IDs, and for group2, there are also 2 unique IDs. The countDistinct function ensures that even if there are duplicate IDs within the group, each ID is only counted once.


More Tags

angular-routing variables using visual-studio-2015 asp.net-3.5 blockchain named-entity-recognition uglifyjs2 angular-template splash-screen

More Programming Guides

Other Guides

More Programming Examples