I have two spark dataframes.
df1
id product price 0 x 100 1 y 120 2 z 110 3 x 150 4 x 100 and df2
id unique_products 0 x 1 y 2 z and how can I get this result:
id unique_products prices 0 x [100, 150, 100] 1 y [120] 2 z [110] You can group by product and apply collect_list on price. And finally join with df2 to obtain the id.
from pyspark.sql import functions as F data1 = [(0, "x", 100,), (1, "y", 120,), (2, "z", 110,), (3, "x", 150,), (4, "x", 100,), ] data2 = [(0, "x", ), (1, "y", ), (2, "z", ), ] df1 = spark.createDataFrame(data1,("id", "product", "price",)) df2 = spark.createDataFrame(data2,("id", "unique_products", )) df_prices = df1.groupBy("product").agg(F.collect_list("price").alias("prices")).selectExpr("product as unique_products", "prices") df2.join(df_prices, ["unique_products"]).select("id", "unique_products", "prices").show() +---+---------------+---------------+ | id|unique_products| prices| +---+---------------+---------------+ | 0| x|[100, 150, 100]| | 1| y| [120]| | 2| z| [110]| +---+---------------+---------------+