I am not sure if the long work is doing this to me but I am seeing some unexpected behavior in spark 2.2.0
I have created a toy example as below
toy_df = spark.createDataFrame([ ['p1','a'], ['p1','b'], ['p1','c'], ['p2','a'], ['p2','b'], ['p2','d']],schema=['patient','drug']) I create another dataframe
mdf = toy_df.filter(toy_df.drug == 'c') as you know mdf would be
mdf.show() +-------+----+ |patient|drug| +-------+----+ | p1| c| +-------+----+ Now If I do this
toy_df.join(mdf,["patient"],"left").select(toy_df.patient.alias("P1"),toy_df.drug.alias('D1'),mdf.patient,mdf.drug).show() Surprisingly I get
+---+---+-------+----+ | P1| D1|patient|drug| +---+---+-------+----+ | p2| a| p2| a| | p2| b| p2| b| | p2| d| p2| d| | p1| a| p1| a| | p1| b| p1| b| | p1| c| p1| c| +---+---+-------+----+ but if I use
toy_df.join(mdf,["patient"],"left").show() I do see the expected behavior
patient|drug|drug| +-------+----+----+ | p2| a|null| | p2| b|null| | p2| d|null| | p1| a| c| | p1| b| c| | p1| c| c| +-------+----+----+ and if I use an alias expression on one of the dataframes I do get the expected behavior
toy_df.join(mdf.alias('D'),on=["patient"],how="left").select(toy_df.patient.alias("P1"),toy_df.drug.alias("D1"),'D.drug').show() | P1| D1|drug| +---+---+----+ | p2| a|null| | p2| b|null| | p2| d|null| | p1| a| c| | p1| b| c| | p1| c| c| +---+---+----+ So my question is what is the best way to select columns after join and is this behavior normal
edit : as per user8371915 this is same as the question tagged as
Spark SQL performing carthesian join instead of inner join
but my question works with two dataframe who have same lineage and performing the join when the show method is invoked but the select columns after join behaving differently .
df.colordf['col']is aColumntype which is not bound to the dataframe, I believe the result is expected. I'm wondering why aren't you gettingambiguous column nameerror while selecting in the erroneous case.DataFramessharing the same lineage can result in trivially true / false predicates. This case should be handled automatically, but it looks like things slipped through the cracks here. Honest advice - always use aliases.