0

Good day everyone,

Suppose I want to join the following two dataframes so that the one row of df2 gets duplicated in the join for every row in df1 which has the same id:

df1 = sc.parallelize(Array((0, 1.5, 8.4, 9.1), (0, 5.9, 1.3, 3.3), (0, 3.2, 7.9, 4.2), (0, 6.9, 2.2, 1.2))) .toDF("id", "col1", "col2", "col3") df2 = sc.parallelize(Array((0, 2.1, 3.4, 4.2))) .toDF("id", "col4", "col5", "col6") 

The desired result would look as follows:

df_joined.show() +---+-----+-----+-----+----+-----+-----+ | id| col1|col2 |col3 |col4|col5 |col6 | +---+-----+-----+-----+----+-----+-----+ | 0| 1.5| 8.4| 9.1| 2.1| 3.4| 4.2| | 0| 5.9| 1.3| 3.3| 2.1| 3.4| 4.2| | 0| 3.2| 7.9| 4.2| 2.1| 3.4| 4.2| | 0| 6.9| 2.2| 1.2| 2.1| 3.4| 4.2| +---+-----+-----+-----+----+-----+-----+ 

I have tried all outer joins, as well as a crossJoin which is the cartesian product between the two dfs, but didn't work. A brute solution would be to just duplicate the rows of df2 the number of times the corresponding id appears in df1 and then do a normal outer join, but I think there must be a way to get the desired result by using joins.

Any help appreciated.

1
  • the issue was that with crossJoin, you cannot specify on which column to do the join. its just a standard cartesian product. so the output was incorrect Commented Apr 19, 2021 at 8:13

1 Answer 1

1

I think you can just join on the id:

df1.join(df2, 'id').show() +---+----+----+----+----+----+----+ | id|col1|col2|col3|col4|col5|col6| +---+----+----+----+----+----+----+ | 0| 1.5| 8.4| 9.1| 2.1| 3.4| 4.2| | 0| 5.9| 1.3| 3.3| 2.1| 3.4| 4.2| | 0| 3.2| 7.9| 4.2| 2.1| 3.4| 4.2| | 0| 6.9| 2.2| 1.2| 2.1| 3.4| 4.2| +---+----+----+----+----+----+----+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.