9

I have a pySpark dataframe, I want to group by a column and then find unique items in another column for each group.

In pandas I could do,

data.groupby(by=['A'])['B'].unique() 

I want to do the same with my spark dataframe. I could find the distictCount of items in the group and count also, like this

(spark_df.groupby('A') .agg( fn.countDistinct(col('B')) .alias('unique_count_B'), fn.count(col('B')) .alias('count_B') ) .show()) 

But I couldn't find some function to find unique items in the group.

For clarifying more consider a sample dataframe,

df = spark.createDataFrame( [(1, "a"), (1, "b"), (1, "a"), (2, "c")], ["A", "B"]) 

I am expecting to get an output like this,

+---+----------+ | A| unique_B| +---+----------+ | 1| [a, b] | | 2| [c] | +---+----------+ 

How to do get the output as in pandas in pySpark.?

0

3 Answers 3

13

I used collect_set for my purpose like this,

(df.groupby('A') .agg( fn.collect_set(col('B')).alias('unique_count_B') ) .show()) 

I get the following output as I need,

+---+--------------+ | A|unique_count_B| +---+--------------+ | 1| [b, a]| | 2| [c]| +---+--------------+ 
Sign up to request clarification or add additional context in comments.

Comments

1

you can use the following code, that uses Window functions.

from pyspark.sql import functions as F from pyspark.sql import Window df = spark.createDataFrame( [(1, "a"), (1, "b"), (1, "a"), (2, "c")], ["A", "B"]) win = Window.partitionBy("A", "B") df.withColumn("distinct AB", F.count("*").over(win)).distinct().show() 

The result is:

+---+---+-----------+ | A| B|distinct AB| +---+---+-----------+ | 2| c| 1| | 1| a| 2| | 1| b| 1| +---+---+-----------+ 

3 Comments

I have updated my question with a sample data frame and output I need.
In this case, try pyspark.sql.functions.collect_set instead of F.count
yeah, I just figured it out
-1

You could use a list comprehension for this After your grouping let's say the dataframe is in spark_df, you can use:

[row.k for row in spark_df.select('k').distinct().collect()] 

1 Comment

I am not looking for something with loops

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.