Joining two data frames and result data frames contain non duplicate items in PySpark?

Question

I have created two data frames by executing below command. I want to join the two data frames and result data frames contain non duplicate items in PySpark.

df1 = sc.parallelize([ ("a",1,1), ("b",2,2), ("d",4,2), ("e",4,1), ("c",3,4)]).toDF(['SID','SSection','SRank']) df1.show()

+---+--------+-----+ |SID|SSection|SRank| +---+--------+-----+ | a| 1| 1| | b| 2| 2| | d| 4| 2| | e| 4| 1| | c| 3| 4| +---+--------+-----+

df2 is

df2=sc.parallelize([ ("a",2,1), ("b",2,3), ("f",4,2), ("e",4,1), ("c",3,4)]).toDF(['SID','SSection','SRank'])

+---+--------+-----+ |SID|SSection|SRank| +---+--------+-----+ | a| 2| 1| | b| 2| 3| | f| 4| 2| | e| 4| 1| | c| 3| 4|ggVG +---+--------+-----+

I want to join above two tables like below.

+---+--------+----------+----------+ |SID|SSection|test1SRank|test2SRank| +---+--------+----------+----------+ | f| 4| 0| 2| | e| 4| 1| 1| | d| 4| 2| 0| | c| 3| 4| 4| | b| 2| 2| 3| | a| 1| 1| 0| | a| 2| 0| 1| +---+--------+----------+----------+

philantrovert · Accepted Answer · 2018-02-12 10:31:29Z

Doesn't look like something that can be achieved with a single join. Here's a solution involving multiple joins:

from pyspark.sql.functions import col d1 = df1.unionAll(df2).select("SID" , "SSection" ).distinct() t1 = d1.join(df1 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test1Srank")) t2 = d1.join(df2 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test2Srank")) t1.join(t2, ["SID", "SSection"]).na.fill(0).show() +---+--------+----------+----------+ |SID|SSection|test1Srank|test2Srank| +---+--------+----------+----------+ | b| 2| 2| 3| | c| 3| 4| 4| | d| 4| 2| 0| | e| 4| 1| 1| | f| 4| 0| 2| | a| 1| 1| 0| | a| 2| 0| 1| +---+--------+----------+----------+

Anahcolus · Accepted Answer · 2018-02-12 10:45:55Z

You can simply rename the SRank column names and use outer join and use na.fill function

df1.withColumnRenamed("SRank", "test1SRank").join(df2.withColumnRenamed("SRank", "test2SRank"), ["SID", "SSection"], "outer").na.fill(0)

Collectives™ on Stack Overflow

Joining two data frames and result data frames contain non duplicate items in PySpark?

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related