I have a pySpark dataframe, I want to group by a column and then find unique items in another column for each group.
In pandas I could do,
data.groupby(by=['A'])['B'].unique() I want to do the same with my spark dataframe. I could find the distictCount of items in the group and count also, like this
(spark_df.groupby('A') .agg( fn.countDistinct(col('B')) .alias('unique_count_B'), fn.count(col('B')) .alias('count_B') ) .show()) But I couldn't find some function to find unique items in the group.
For clarifying more consider a sample dataframe,
df = spark.createDataFrame( [(1, "a"), (1, "b"), (1, "a"), (2, "c")], ["A", "B"]) I am expecting to get an output like this,
+---+----------+ | A| unique_B| +---+----------+ | 1| [a, b] | | 2| [c] | +---+----------+ How to do get the output as in pandas in pySpark.?