I am new to Data Science and I am working on a simple self project using Google Colab. I took a data from a something1.csv file and something2.csv file.
df1 = spark.read.csv('something1.csv', header=True) df2 = spark.read.csv('something2.csv', header=True) The data of something1.csv looks like
#+----------+---------+----------+--------+-------+ #| country| Latitude| Longitude| col1 | col2 | #+----------+---------+----------+--------+-------+ #| Andorra| 42.506| 1.5218| 2 | 1 | #| Australia| -31.81| 115.86| 1 | 6 | #| Austria| 41.597| 12.580| 4 | 9 | #| Belgium| 21.782| 1.286| 2 | 3 | #| India| 78.389| 12.972| 1 | 7 | #| Ireland| 9.281| 9.286| 9 | 8 | #| USA| 69.371| 21.819| 7 | 2 | #+----------+---------+----------+--------+-------+ The data of something2.csv looks like
#+----------+---------+----------+--------+-------+ #| country| Latitude| Longitude| col1 | col2 | #+----------+---------+----------+--------+-------+ #| Australia| -31.81| 115.86| 2 | 6 | #| Belgium| 21.782| 1.286| 1 | 6 | #| India| 78.389| 12.972| 3 | 5 | #| USA| 69.371| 21.819| 2 | 5 | #+----------+---------+----------+--------+-------+ Now I want to intersect df2 with df1 based on Longitude and Latitude and get the rows that are present in df1 along with col1 and col2 from df1. My table should look like
#+----------+---------+----------+--------+-------+ #| country| Latitude| Longitude| col1 | col2 | #+----------+---------+----------+--------+-------+ #| Australia| -31.81| 115.86| 1 | 6 | #| Belgium| 21.782| 1.286| 2 | 3 | #| India| 78.389| 12.972| 1 | 7 | #| USA| 69.371| 21.819| 7 | 2 | #+----------+---------+----------+--------+-------+ I tried using the following code but didn't work.
new_df = df1.intersect(df2) #using the intersection in pyspark which gave me null table Then I also tried based on Latitude and Longitude
new_df = df2.select('Latitude','Longitude').intersect(df1.select('Latitude','Logitude')) #intersecting based on columns I tried both the above methods in pyspark but didn't work.