I have two Pyspark dataframes say 'a' and 'b'. Which I left joined by selecting few fields from dataframe 'a' directly while for other fields I am checking condition (If filed of 'a' is null then select dataframe 'b' filed). Below codes are absolutely working fine and I am getting required result.
df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\ .select( 'a.name_id','a.SUM','a.full_name', f.when(f.isnull(f.col('a.first_name')),f.col('b.first_name')).otherwise(f.col('a.first_name')).alias('first_name'), f.when(f.isnull(f.col('a.last_name')),f.col('b.last_name')).otherwise(f.col('a.last_name')).alias('last_name'), f.when(f.isnull(f.col('a.email')),f.col('b.email')).otherwise(f.col('a.email')).alias('email'), f.when(f.isnull(f.col('a.phone_number')),f.col('b.phone_number')).otherwise(f.col('a.phone_number')).alias('phone_number'), f.when(f.isnull(f.col('a.address')),f.col('b.address')).otherwise(f.col('a.address')).alias('address'), f.when(f.isnull(f.col('a.address_2')),f.col('b.address_2')).otherwise(f.col('a.address_2')).alias('address_2'), f.when(f.isnull(f.col('a.city')),f.col('b.city')).otherwise(f.col('a.city')).alias('city'), f.when(f.isnull(f.col('a.email_alt')),f.col('b.email_alt')).otherwise(f.col('a.email_alt')).alias('email_alt'), 'a.updated','a.date','a.client_reference_code','a.reservation_status',\ 'a.total_cancellations','a.total_covers','a.total_noshows','a.total_spend',\ 'a.total_spend_per_cover','a.total_spend_per_visit','a.total_visits','a.id') I wonder if number of fields increase over the time then how I will handle these codes using loop so that I can automate it.
I tried below code but getting error, Can anyone help?
col_list = [all required fields]
df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\ .select('a.name_id','a.SUM','a.full_name',\ for x in col_list: f.when(f.isnull(f.col('a.x')),f.col('b.x')).otherwise(f.col('a.x')).alias('x'), ) I think in select I cant use loop , Please suggest me other way.strong text