0

I have a Dataframe resulting from a join of two Dataframes: df1 and df2 into df3. All the columns found in df2 are also in df1, but their contents differ. I'd like to remove all the df1 columns which names are in df2.columns from the join. Would there be a way to do this without using a var? Currently I've done this

var ret = df3 df2.columns.foreach(coln => ret = ret.drop(df2(coln))) 

but what I really want is just a shortcut for

df3.drop(df1(df2.columns(1))).drop(df1(df2.columns(2))).... 

without using a var.

Passing a list of columns is not an option, don't know if it's because I'm using spark 2.2

EDIT:

Important note: I don't know in advance the columns of df1 and df2

2
  • Did you try specify the join column as an array type or string? docs.databricks.com/spark/latest/faq/… Commented Mar 25, 2019 at 12:23
  • @KZapagol thanks but that's not what I need, columns with the same name contain different values, that's why I need to remove only the one from the first dataframe Commented Mar 25, 2019 at 12:27

2 Answers 2

3

This is possible to achieve while you are performing the join itself. Please try the below code

 val resultDf=df1.alias("frstdf").join(broadcast(df2).alias("scndf"), $"frstdf.col1" === $"scndf.col1", "left_outer").selectExpr("scndf.col1","scndf.col2"...)//.selectExpr("scndf.*") 

This would only contain the columns from the second data frame. Hope this helps

Sign up to request clarification or add additional context in comments.

3 Comments

My question is scala oriented - I need to know if there is a shortcut for applying a chain of transformations. Anyways this is a really good answer, I'll consider accepting this as the final answer if no other answer will be better.
Its written in Scala. Chain of Transformations is also possible within SelectExpr. You can concat, calculate , cast columns very easily.
Sorry for the late reply, but as I specified in my edit, I don't know in advance the columns of the two dataframes in input
1

A shortcut would be:

val ret = df2.columns.foldLeft(df3)((acc,coln) => acc.drop(df2(coln))) 

I would suggest to remove the columns before the join. Alternatively, select only the columns from df3 which come from df2:

val ret = df3.select(df2.columns.map(col):_*) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.