1

I have two spark dataframes.

df1

id product price 0 x 100 1 y 120 2 z 110 3 x 150 4 x 100 

and df2

id unique_products 0 x 1 y 2 z 

and how can I get this result:

id unique_products prices 0 x [100, 150, 100] 1 y [120] 2 z [110] 

1 Answer 1

1

You can group by product and apply collect_list on price. And finally join with df2 to obtain the id.

from pyspark.sql import functions as F data1 = [(0, "x", 100,), (1, "y", 120,), (2, "z", 110,), (3, "x", 150,), (4, "x", 100,), ] data2 = [(0, "x", ), (1, "y", ), (2, "z", ), ] df1 = spark.createDataFrame(data1,("id", "product", "price",)) df2 = spark.createDataFrame(data2,("id", "unique_products", )) df_prices = df1.groupBy("product").agg(F.collect_list("price").alias("prices")).selectExpr("product as unique_products", "prices") df2.join(df_prices, ["unique_products"]).select("id", "unique_products", "prices").show() 

Output

+---+---------------+---------------+ | id|unique_products| prices| +---+---------------+---------------+ | 0| x|[100, 150, 100]| | 1| y| [120]| | 2| z| [110]| +---+---------------+---------------+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.