How to get all rows by joining 2 dataframes

Question

I have 2 dataframes:

Location Places Date Date_part Sector Category units USA Place_1 7/1/2021 7/1/2021 Cars Passenger 200 IND Place_2 7/1/2021 7/1/2021 Cars Passenger 180 COL Place_3 7/1/2021 7/1/2021 Trucks Goods 100

Location Places Clicks USA Place_1 320 COL Place_3 730 CAN Place_4 84

I need to create a new dataframes like this

Location Places Date Date_part Sector Category units ClicksPerDay USA Place_1 7/1/2021 7/1/2021 Cars Passenger 200 320 IND Place_2 7/1/2021 7/1/2021 Cars Passenger 180 COL Place_3 7/1/2021 7/1/2021 Trucks Goods 100 730 CAN Place_4 7/1/2021 7/1/2021 Cars Passenger 84

I need ClicksPerDay column to be created from Clicks and value to be available where ever Sector= Cars and Category=Passenger

I tried using df_result = df1.unionByName(df2)

But I need to achieve this by joining 2 dataframes. The result dataframe should be having all rows from both dataframes Please share any other approaches.

Does this answer your question? Joining 2 Dataframes on multiple columns Pandas — Almog-at-Nailo
– Almog-at-Nailo, Commented Aug 25, 2021 at 8:51

Ofek Glick · Accepted Answer · 2021-08-25 09:01:59Z

Assuming you want to join the Dataframes by 'location' you can preform df1.join(df2, 'Location', how='outer')
This will give the output you specified. You can read more here
In regards to the column ClicksPerDay, you can do that with withColumn. As far as I see, all that's different is the name of the column so you can do:

 import pyspark.sql.functions as F df.withColumn("ClicksPerDay",F.col("Clicks")).drop("Clicks").

If the joined table matches the original table where Sector=Cars and Category=Passenger Then no further action is required, if not then you can set specific rows to null according to this answer

Vaebhav · Accepted Answer · 2021-08-25 09:30:28Z

You can achieve this with an Outer Join.

And replace the neccessary/desired columns within sparkDF1 are Null you can replace them with the values from sparkDF2

Data Preparation

input_str1 = """ Location Places Date Date_part Sector Category units USA Place_1 7/1/2021 7/1/2021 Cars Passenger 200 IND Place_2 7/1/2021 7/1/2021 Cars Passenger 180 COL Place_3 7/1/2021 7/1/2021 Trucks Goods 100 """.split() input_value1 = list(map(lambda x:x.strip(),input_str1)) cols1 = input_value1[:7] n = len(input_value1[7:]) input_list1 = [tuple(input_value1[i:i+7]) for i in range(7,n+1,7)] input_str2 = """ Location Places Clicks USA Place_1 320 COL Place_3 730 CAN Place_4 84 """.split() input_value2 = list(map(lambda x:x.strip(),input_str2)) cols2 = input_value2[:3] n = len(input_value2[3:]) input_list2 = [tuple(input_value2[i:i+3]) for i in range(3,n+1,3)] sparkDF1 = sql.createDataFrame(input_list1,cols1) sparkDF2 = sql.createDataFrame(input_list2,cols2) sparkDF1.show() +--------+-------+--------+---------+------+---------+-----+ |Location| Places| Date|Date_part|Sector| Category|units| +--------+-------+--------+---------+------+---------+-----+ | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| +--------+-------+--------+---------+------+---------+-----+ sparkDF2.show() +--------+-------+------+ |Location| Places|Clicks| +--------+-------+------+ | USA|Place_1| 320| | COL|Place_3| 730| | CAN|Place_4| 84| +--------+-------+------+

Note - All columns are Strings

Join

finalDF = sparkDF1.join(sparkDF2 ,(sparkDF1['Location'] == sparkDF2['Location']) & (sparkDF1['Places'] == sparkDF2['Places']) ,'outer' ).select(sparkDF1["*"] ,sparkDF2['Clicks'].alias('ClicksPerDay') ,sparkDF2['Location'].alias('Location_Secondary') ,sparkDF2['Places'].alias('Places_Secondary') ) finalDF.show() +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+ |Location| Places| Date|Date_part|Sector| Category|units|ClicksPerDay|Location_Secondary|Places_Secondary| +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+ | null| null| null| null| null| null| null| 84| CAN| Place_4| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| null| null| null| | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| 320| USA| Place_1| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| 730| COL| Place_3| +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+

Replacing Null Values

Precedence is given to sparkDF1 , which you can modify as your your requirement

finalDF = finalDF.withColumn('Location',F.when(F.col('Location').isNull() ,F.col('Location_Secondary')).otherwise(F.col('Location')))\ .withColumn('Places',F.when(F.col('Places').isNull() ,F.col('Places_Secondary')).otherwise(F.col('Places')))\ final_column_list = ['Location','Places','Date','Date_part','Sector','Category','units','ClicksPerDay'] finalDF = finalDF.select(*final_column_list).orderBy(F.col('Location').desc()) finalDF.show() +--------+-------+--------+---------+------+---------+-----+------------+ |Location| Places| Date|Date_part|Sector| Category|units|ClicksPerDay| +--------+-------+--------+---------+------+---------+-----+------------+ | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| 320| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| null| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| 730| | CAN|Place_4| null| null| null| null| null| 84| +--------+-------+--------+---------+------+---------+-----+------------+

Collectives™ on Stack Overflow

How to get all rows by joining 2 dataframes

2 Answers 2

Comments

Data Preparation

Join

Replacing Null Values

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Data Preparation

Join

Replacing Null Values

Comments

Linked

Related