1

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:

df1:

 id a b 2 1 1 3 0.5 1 4 1 2 5 2 1 

df2:

 id c d 2 fs a 5 fa f 

Desired output:

df: id a b 2 1 1 5 2 1 

I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.

2 Answers 2

5

df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:

df1.join(df2.select("id"), "id").show() +---+---+---+ | id| a| b| +---+---+---+ | 5|2.0| 1| | 2|1.0| 1| +---+---+---+ 

or:

df1.join(df2[["id"]], "id").show() +---+---+---+ | id| a| b| +---+---+---+ | 5|2.0| 1| | 2|1.0| 1| +---+---+---+ 
Sign up to request clarification or add additional context in comments.

1 Comment

Sorry, I have copied my syntaxt, but made another dataframes here. Imsi is and ID
2

If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).

df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int") df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d']) 

Create df2.id as list and pass it to df1 under isin()

from pyspark.sql.functions import col df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect() df1.where(col('id').isin(df2_list)).show() #+---+---+---+ #| id| a| b| #+---+---+---+ #| 2| 1| 1| #| 5| 2| 1| #+---+---+---+ 

It is reccomended to use isin() IF -

  • You don't need to return data from the refrence dataframe/table

  • You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)

  • You just want to check existence of particular value

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.