Intersection of two data frames with different columns in Pyspark

Question

I am new to Data Science and I am working on a simple self project using Google Colab. I took a data from a something1.csv file and something2.csv file.

df1 = spark.read.csv('something1.csv', header=True) df2 = spark.read.csv('something2.csv', header=True)

The data of something1.csv looks like

#+----------+---------+----------+--------+-------+ #| country| Latitude| Longitude| col1 | col2 | #+----------+---------+----------+--------+-------+ #| Andorra| 42.506| 1.5218| 2 | 1 | #| Australia| -31.81| 115.86| 1 | 6 | #| Austria| 41.597| 12.580| 4 | 9 | #| Belgium| 21.782| 1.286| 2 | 3 | #| India| 78.389| 12.972| 1 | 7 | #| Ireland| 9.281| 9.286| 9 | 8 | #| USA| 69.371| 21.819| 7 | 2 | #+----------+---------+----------+--------+-------+

The data of something2.csv looks like

#+----------+---------+----------+--------+-------+ #| country| Latitude| Longitude| col1 | col2 | #+----------+---------+----------+--------+-------+ #| Australia| -31.81| 115.86| 2 | 6 | #| Belgium| 21.782| 1.286| 1 | 6 | #| India| 78.389| 12.972| 3 | 5 | #| USA| 69.371| 21.819| 2 | 5 | #+----------+---------+----------+--------+-------+

Now I want to intersect df2 with df1 based on Longitude and Latitude and get the rows that are present in df1 along with col1 and col2 from df1. My table should look like

#+----------+---------+----------+--------+-------+ #| country| Latitude| Longitude| col1 | col2 | #+----------+---------+----------+--------+-------+ #| Australia| -31.81| 115.86| 1 | 6 | #| Belgium| 21.782| 1.286| 2 | 3 | #| India| 78.389| 12.972| 1 | 7 | #| USA| 69.371| 21.819| 7 | 2 | #+----------+---------+----------+--------+-------+

I tried using the following code but didn't work.

new_df = df1.intersect(df2) #using the intersection in pyspark which gave me null table

Then I also tried based on Latitude and Longitude

new_df = df2.select('Latitude','Longitude').intersect(df1.select('Latitude','Logitude')) #intersecting based on columns

I tried both the above methods in pyspark but didn't work.

df1.intersect(df2) this will give empty result because you don't have matching rows in both df, which is expected. second one should give result because both columns have matching rows in both dfs. — s.polam
– s.polam, Commented May 4, 2020 at 15:29

notNull · Accepted Answer · 2020-05-04 18:26:24Z

Intersect only gets rows that are common in both dataframes.

But in your case you need col1,col2 from df1 and other columns from df2, Join the dataframes (left/inner as per requirement) and select only col1,col2 from df1 and other columns from df2.
(or) As mentioned in comments by Mohammad Murtaza Hashmi Use left_semi join

Example:

#using left semi join df1.join(df2,['Latitude','Longitude'],'left_semi').show() #using left join df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'left').select("t2.country","t2.Latitude","t2.Longitude","t1.col1","t1.col2").show() #+---------+--------+---------+----+----+ #| country|Latitude|Longitude|col1|col2| #+---------+--------+---------+----+----+ #|Australia| -31.81| 115.86| 1| 6| #| Belgium| 21.782| 1.286| 2| 3| #| India| 78.389| 12.972| 1| 7| #| USA| 69.371| 21.819| 7| 2| #+---------+--------+---------+----+----+

Dynamic way:

#join columns join_cols=[x for x in df1.columns if x.startswith("L")] #selecting cols from t1 t1_cols=["t1."+x for x in df1.columns if x.startswith("col")] #selecting cols from t2 t2_cols=["t2."+x for x in df2.columns if not x.startswith("col")] df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'inner').select(*t2_cols + t1_cols).show() #+---------+--------+---------+----+----+ #| country|Latitude|Longitude|col1|col2| #+---------+--------+---------+----+----+ #|Australia| -31.81| 115.86| 1| 6| #| Belgium| 21.782| 1.286| 2| 3| #| India| 78.389| 12.972| 1| 7| #| USA| 69.371| 21.819| 7| 2| #+---------+--------+---------+----+----+

How to do the above joining if I have multiple columns like 50 or more columns?
@Shu why not use left_semi? df1.join(df2,['Latitude','Longitude'],'left_semi')
@MohammadMurtazaHashmi, Sure we can use that too.., Please add as answer..!
@R0bert, Please check the updated answer!, we can create variables based on df.columns and use them in join.
@Shu Thanks for the answer, but my problem is that the column names in the data set are encrypted with ####, So I don't know the column names. I need to get the intersection of the tables only based on the Latitude and Longitude and also get the other 50-100 columns

Collectives™ on Stack Overflow

Intersection of two data frames with different columns in Pyspark

1 Answer 1

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Related