-1

I have 2 dataframes:

Location Places Date Date_part Sector Category units USA Place_1 7/1/2021 7/1/2021 Cars Passenger 200 IND Place_2 7/1/2021 7/1/2021 Cars Passenger 180 COL Place_3 7/1/2021 7/1/2021 Trucks Goods 100 
Location Places Clicks USA Place_1 320 COL Place_3 730 CAN Place_4 84 

I need to create a new dataframes like this

Location Places Date Date_part Sector Category units ClicksPerDay USA Place_1 7/1/2021 7/1/2021 Cars Passenger 200 320 IND Place_2 7/1/2021 7/1/2021 Cars Passenger 180 COL Place_3 7/1/2021 7/1/2021 Trucks Goods 100 730 CAN Place_4 7/1/2021 7/1/2021 Cars Passenger 84 

I need ClicksPerDay column to be created from Clicks and value to be available where ever Sector= Cars and Category=Passenger

I tried using df_result = df1.unionByName(df2)

But I need to achieve this by joining 2 dataframes. The result dataframe should be having all rows from both dataframes Please share any other approaches.

2

2 Answers 2

0

Assuming you want to join the Dataframes by 'location' you can preform df1.join(df2, 'Location', how='outer')
This will give the output you specified. You can read more here
In regards to the column ClicksPerDay, you can do that with withColumn. As far as I see, all that's different is the name of the column so you can do:

 import pyspark.sql.functions as F df.withColumn("ClicksPerDay",F.col("Clicks")).drop("Clicks"). 

If the joined table matches the original table where Sector=Cars and Category=Passenger Then no further action is required, if not then you can set specific rows to null according to this answer

Sign up to request clarification or add additional context in comments.

Comments

0

You can achieve this with an Outer Join.

And replace the neccessary/desired columns within sparkDF1 are Null you can replace them with the values from sparkDF2

Data Preparation

input_str1 = """ Location Places Date Date_part Sector Category units USA Place_1 7/1/2021 7/1/2021 Cars Passenger 200 IND Place_2 7/1/2021 7/1/2021 Cars Passenger 180 COL Place_3 7/1/2021 7/1/2021 Trucks Goods 100 """.split() input_value1 = list(map(lambda x:x.strip(),input_str1)) cols1 = input_value1[:7] n = len(input_value1[7:]) input_list1 = [tuple(input_value1[i:i+7]) for i in range(7,n+1,7)] input_str2 = """ Location Places Clicks USA Place_1 320 COL Place_3 730 CAN Place_4 84 """.split() input_value2 = list(map(lambda x:x.strip(),input_str2)) cols2 = input_value2[:3] n = len(input_value2[3:]) input_list2 = [tuple(input_value2[i:i+3]) for i in range(3,n+1,3)] sparkDF1 = sql.createDataFrame(input_list1,cols1) sparkDF2 = sql.createDataFrame(input_list2,cols2) sparkDF1.show() +--------+-------+--------+---------+------+---------+-----+ |Location| Places| Date|Date_part|Sector| Category|units| +--------+-------+--------+---------+------+---------+-----+ | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| +--------+-------+--------+---------+------+---------+-----+ sparkDF2.show() +--------+-------+------+ |Location| Places|Clicks| +--------+-------+------+ | USA|Place_1| 320| | COL|Place_3| 730| | CAN|Place_4| 84| +--------+-------+------+ 

Note - All columns are Strings

Join

finalDF = sparkDF1.join(sparkDF2 ,(sparkDF1['Location'] == sparkDF2['Location']) & (sparkDF1['Places'] == sparkDF2['Places']) ,'outer' ).select(sparkDF1["*"] ,sparkDF2['Clicks'].alias('ClicksPerDay') ,sparkDF2['Location'].alias('Location_Secondary') ,sparkDF2['Places'].alias('Places_Secondary') ) finalDF.show() +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+ |Location| Places| Date|Date_part|Sector| Category|units|ClicksPerDay|Location_Secondary|Places_Secondary| +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+ | null| null| null| null| null| null| null| 84| CAN| Place_4| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| null| null| null| | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| 320| USA| Place_1| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| 730| COL| Place_3| +--------+-------+--------+---------+------+---------+-----+------------+------------------+----------------+ 

Replacing Null Values

Precedence is given to sparkDF1 , which you can modify as your your requirement

finalDF = finalDF.withColumn('Location',F.when(F.col('Location').isNull() ,F.col('Location_Secondary')).otherwise(F.col('Location')))\ .withColumn('Places',F.when(F.col('Places').isNull() ,F.col('Places_Secondary')).otherwise(F.col('Places')))\ final_column_list = ['Location','Places','Date','Date_part','Sector','Category','units','ClicksPerDay'] finalDF = finalDF.select(*final_column_list).orderBy(F.col('Location').desc()) finalDF.show() +--------+-------+--------+---------+------+---------+-----+------------+ |Location| Places| Date|Date_part|Sector| Category|units|ClicksPerDay| +--------+-------+--------+---------+------+---------+-----+------------+ | USA|Place_1|7/1/2021| 7/1/2021| Cars|Passenger| 200| 320| | IND|Place_2|7/1/2021| 7/1/2021| Cars|Passenger| 180| null| | COL|Place_3|7/1/2021| 7/1/2021|Trucks| Goods| 100| 730| | CAN|Place_4| null| null| null| null| null| 84| +--------+-------+--------+---------+------+---------+-----+------------+ 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.