I have created two data frames by executing below command. I want to join the two data frames and result data frames contain non duplicate items in PySpark.
df1 = sc.parallelize([ ("a",1,1), ("b",2,2), ("d",4,2), ("e",4,1), ("c",3,4)]).toDF(['SID','SSection','SRank']) df1.show() +---+--------+-----+ |SID|SSection|SRank| +---+--------+-----+ | a| 1| 1| | b| 2| 2| | d| 4| 2| | e| 4| 1| | c| 3| 4| +---+--------+-----+ df2 is
df2=sc.parallelize([ ("a",2,1), ("b",2,3), ("f",4,2), ("e",4,1), ("c",3,4)]).toDF(['SID','SSection','SRank']) +---+--------+-----+ |SID|SSection|SRank| +---+--------+-----+ | a| 2| 1| | b| 2| 3| | f| 4| 2| | e| 4| 1| | c| 3| 4|ggVG +---+--------+-----+ I want to join above two tables like below.
+---+--------+----------+----------+ |SID|SSection|test1SRank|test2SRank| +---+--------+----------+----------+ | f| 4| 0| 2| | e| 4| 1| 1| | d| 4| 2| 0| | c| 3| 4| 4| | b| 2| 2| 3| | a| 1| 1| 0| | a| 2| 0| 1| +---+--------+----------+----------+