Pyspark: Join 2 dataframes with different number of rows by duplication

Question

Good day everyone,

Suppose I want to join the following two dataframes so that the one row of df2 gets duplicated in the join for every row in df1 which has the same id:

df1 = sc.parallelize(Array((0, 1.5, 8.4, 9.1), (0, 5.9, 1.3, 3.3), (0, 3.2, 7.9, 4.2), (0, 6.9, 2.2, 1.2))) .toDF("id", "col1", "col2", "col3") df2 = sc.parallelize(Array((0, 2.1, 3.4, 4.2))) .toDF("id", "col4", "col5", "col6")

The desired result would look as follows:

df_joined.show() +---+-----+-----+-----+----+-----+-----+ | id| col1|col2 |col3 |col4|col5 |col6 | +---+-----+-----+-----+----+-----+-----+ | 0| 1.5| 8.4| 9.1| 2.1| 3.4| 4.2| | 0| 5.9| 1.3| 3.3| 2.1| 3.4| 4.2| | 0| 3.2| 7.9| 4.2| 2.1| 3.4| 4.2| | 0| 6.9| 2.2| 1.2| 2.1| 3.4| 4.2| +---+-----+-----+-----+----+-----+-----+

I have tried all outer joins, as well as a crossJoin which is the cartesian product between the two dfs, but didn't work. A brute solution would be to just duplicate the rows of df2 the number of times the corresponding id appears in df1 and then do a normal outer join, but I think there must be a way to get the desired result by using joins.

Any help appreciated.

the issue was that with crossJoin, you cannot specify on which column to do the join. its just a standard cartesian product. so the output was incorrect — Emile Beukes
– Emile Beukes, Commented Apr 19, 2021 at 8:13

mck · Accepted Answer · 2021-04-19 08:15:21Z

I think you can just join on the id:

df1.join(df2, 'id').show() +---+----+----+----+----+----+----+ | id|col1|col2|col3|col4|col5|col6| +---+----+----+----+----+----+----+ | 0| 1.5| 8.4| 9.1| 2.1| 3.4| 4.2| | 0| 5.9| 1.3| 3.3| 2.1| 3.4| 4.2| | 0| 3.2| 7.9| 4.2| 2.1| 3.4| 4.2| | 0| 6.9| 2.2| 1.2| 2.1| 3.4| 4.2| +---+----+----+----+----+----+----+

Collectives™ on Stack Overflow

Pyspark: Join 2 dataframes with different number of rows by duplication

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related