Pyspark Join data frame

Question

I have two spark dataframes.

df1

id product price 0 x 100 1 y 120 2 z 110 3 x 150 4 x 100

and df2

id unique_products 0 x 1 y 2 z

and how can I get this result:

id unique_products prices 0 x [100, 150, 100] 1 y [120] 2 z [110]

Nithish · Accepted Answer · 2021-12-25 11:56:04Z

You can group by product and apply collect_list on price. And finally join with df2 to obtain the id.

from pyspark.sql import functions as F data1 = [(0, "x", 100,), (1, "y", 120,), (2, "z", 110,), (3, "x", 150,), (4, "x", 100,), ] data2 = [(0, "x", ), (1, "y", ), (2, "z", ), ] df1 = spark.createDataFrame(data1,("id", "product", "price",)) df2 = spark.createDataFrame(data2,("id", "unique_products", )) df_prices = df1.groupBy("product").agg(F.collect_list("price").alias("prices")).selectExpr("product as unique_products", "prices") df2.join(df_prices, ["unique_products"]).select("id", "unique_products", "prices").show()

Output

+---+---------------+---------------+ | id|unique_products| prices| +---+---------------+---------------+ | 0| x|[100, 150, 100]| | 1| y| [120]| | 2| z| [110]| +---+---------------+---------------+

Collectives™ on Stack Overflow

Pyspark Join data frame

1 Answer 1

Output

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Output

Comments

Related