0

I have two Pyspark dataframes say 'a' and 'b'. Which I left joined by selecting few fields from dataframe 'a' directly while for other fields I am checking condition (If filed of 'a' is null then select dataframe 'b' filed). Below codes are absolutely working fine and I am getting required result.

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\ .select( 'a.name_id','a.SUM','a.full_name', f.when(f.isnull(f.col('a.first_name')),f.col('b.first_name')).otherwise(f.col('a.first_name')).alias('first_name'), f.when(f.isnull(f.col('a.last_name')),f.col('b.last_name')).otherwise(f.col('a.last_name')).alias('last_name'), f.when(f.isnull(f.col('a.email')),f.col('b.email')).otherwise(f.col('a.email')).alias('email'), f.when(f.isnull(f.col('a.phone_number')),f.col('b.phone_number')).otherwise(f.col('a.phone_number')).alias('phone_number'), f.when(f.isnull(f.col('a.address')),f.col('b.address')).otherwise(f.col('a.address')).alias('address'), f.when(f.isnull(f.col('a.address_2')),f.col('b.address_2')).otherwise(f.col('a.address_2')).alias('address_2'), f.when(f.isnull(f.col('a.city')),f.col('b.city')).otherwise(f.col('a.city')).alias('city'), f.when(f.isnull(f.col('a.email_alt')),f.col('b.email_alt')).otherwise(f.col('a.email_alt')).alias('email_alt'), 'a.updated','a.date','a.client_reference_code','a.reservation_status',\ 'a.total_cancellations','a.total_covers','a.total_noshows','a.total_spend',\ 'a.total_spend_per_cover','a.total_spend_per_visit','a.total_visits','a.id') 

I wonder if number of fields increase over the time then how I will handle these codes using loop so that I can automate it.

I tried below code but getting error, Can anyone help?

col_list = [all required fields]

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\ .select('a.name_id','a.SUM','a.full_name',\ for x in col_list: f.when(f.isnull(f.col('a.x')),f.col('b.x')).otherwise(f.col('a.x')).alias('x'), ) 

I think in select I cant use loop , Please suggest me other way.strong text

1 Answer 1

1

Add required columns or column expressions in list & then pass that list to select.

Check below code.

col_list = [all required fields] 

Using when function

colExpr = ['a.name_id','a.SUM','a.full_name'] + list(map(lambda x: f.when(f.isnull(f.col('a.x')),f.col('b.x')).otherwise(f.col('a.x')).alias('x'),col_list)) 
df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left').select(*colExpr) # select 

Using nvl function

colExpr = ['a.name_id','a.SUM','a.full_name'] + list(map(lambda x: "nvl(a.{},b.{}) as {}".format(x,x,x),col_list)) 
df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left').selectExpr(*colExpr) # selectExpr 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.