How to do groupby and find unique items of a column in PySpark [duplicate]

Question

I have a pySpark dataframe, I want to group by a column and then find unique items in another column for each group.

In pandas I could do,

data.groupby(by=['A'])['B'].unique()

I want to do the same with my spark dataframe. I could find the distictCount of items in the group and count also, like this

(spark_df.groupby('A') .agg( fn.countDistinct(col('B')) .alias('unique_count_B'), fn.count(col('B')) .alias('count_B') ) .show())

But I couldn't find some function to find unique items in the group.

For clarifying more consider a sample dataframe,

df = spark.createDataFrame( [(1, "a"), (1, "b"), (1, "a"), (2, "c")], ["A", "B"])

I am expecting to get an output like this,

+---+----------+ | A| unique_B| +---+----------+ | 1| [a, b] | | 2| [c] | +---+----------+

How to do get the output as in pandas in pySpark.?

Sreeram TP · Accepted Answer · 2019-06-19 08:53:11Z

I used collect_set for my purpose like this,

(df.groupby('A') .agg( fn.collect_set(col('B')).alias('unique_count_B') ) .show())

I get the following output as I need,

+---+--------------+ | A|unique_count_B| +---+--------------+ | 1| [b, a]| | 2| [c]| +---+--------------+

Elior Malul · Accepted Answer · 2019-06-19 08:28:06Z

you can use the following code, that uses Window functions.

from pyspark.sql import functions as F from pyspark.sql import Window df = spark.createDataFrame( [(1, "a"), (1, "b"), (1, "a"), (2, "c")], ["A", "B"]) win = Window.partitionBy("A", "B") df.withColumn("distinct AB", F.count("*").over(win)).distinct().show()

The result is:

+---+---+-----------+ | A| B|distinct AB| +---+---+-----------+ | 2| c| 1| | 1| a| 2| | 1| b| 1| +---+---+-----------+

I have updated my question with a sample data frame and output I need.
In this case, try pyspark.sql.functions.collect_set instead of F.count

Akshay Shah · Accepted Answer · 2019-06-19 08:03:51Z

You could use a list comprehension for this After your grouping let's say the dataframe is in spark_df, you can use:

[row.k for row in spark_df.select('k').distinct().collect()]

Collectives™ on Stack Overflow

How to do groupby and find unique items of a column in PySpark [duplicate]

3 Answers 3

Comments

3 Comments

1 Comment

Linked

Hot Network Questions