You can achieve this with an Outer Join.
And replace the neccessary/desired columns within sparkDF1 are Null you can replace them with the values from sparkDF2
Data Preparation
input_str1 = """ Location Places Date Date_part Sector Category units USA Place_1 7/1/2021 7/1/2021 Cars Passenger 200 IND Place_2 7/1/2021 7/1/2021 Cars Passenger 180 COL Place_3 7/1/2021 7/1/2021 Trucks Goods 100 """.split() input_value1 = list(map(lambda x:x.strip(),input_str1)) cols1 = input_value1[:7] n = len(input_value1[7:]) input_list1 = [tuple(input_value1[i:i+7]) for i in range(7,n+1,7)] input_str2 = """ Location Places Clicks USA Place_1 320 COL Place_3 730 CAN Place_4 84 """.split() input_value2 = list(map(lambda x:x.strip(),input_str2)) cols2 = input_value2[:3] n = len(input_value2[3:]) input_list2 = [tuple(input_value2[i:i+3]) for i in range(3,n+1,3)] sparkDF1 = sql.createDataFrame(input_list1,cols1) sparkDF2 = sql.createDataFrame(input_list2,cols2) sparkDF1.show() +--------+-------+--------+---------+------+---------+-----+ |Location| Places| Date|Date_part|Sector| Category|units| +--------+-------+--------+---------+------+---------+-----+ | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| +--------+-------+--------+---------+------+---------+-----+ sparkDF2.show() +--------+-------+------+ |Location| Places|Clicks| +--------+-------+------+ | USA|Place_1| 320| | COL|Place_3| 730| | CAN|Place_4| 84| +--------+-------+------+
Note - All columns are Strings
Join
finalDF = sparkDF1.join(sparkDF2 ,(sparkDF1['Location'] == sparkDF2['Location']) & (sparkDF1['Places'] == sparkDF2['Places']) ,'outer' ).select(sparkDF1["*"] ,sparkDF2['Clicks'].alias('ClicksPerDay') ,sparkDF2['Location'].alias('Location_Secondary') ,sparkDF2['Places'].alias('Places_Secondary') ) finalDF.show() +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+ |Location| Places| Date|Date_part|Sector| Category|units|ClicksPerDay|Location_Secondary|Places_Secondary| +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+ | null| null| null| null| null| null| null| 84| CAN| Place_4| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| null| null| null| | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| 320| USA| Place_1| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| 730| COL| Place_3| +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+
Replacing Null Values
Precedence is given to sparkDF1 , which you can modify as your your requirement
finalDF = finalDF.withColumn('Location',F.when(F.col('Location').isNull() ,F.col('Location_Secondary')).otherwise(F.col('Location')))\ .withColumn('Places',F.when(F.col('Places').isNull() ,F.col('Places_Secondary')).otherwise(F.col('Places')))\ final_column_list = ['Location','Places','Date','Date_part','Sector','Category','units','ClicksPerDay'] finalDF = finalDF.select(*final_column_list).orderBy(F.col('Location').desc()) finalDF.show() +--------+-------+--------+---------+------+---------+-----+------------+ |Location| Places| Date|Date_part|Sector| Category|units|ClicksPerDay| +--------+-------+--------+---------+------+---------+-----+------------+ | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| 320| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| null| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| 730| | CAN|Place_4| null| null| null| null| null| 84| +--------+-------+--------+---------+------+---------+-----+------------+
join, notunion.