Pyspark Code optimization - to handle it in better way

Question

I have two Pyspark dataframes say 'a' and 'b'. Which I left joined by selecting few fields from dataframe 'a' directly while for other fields I am checking condition (If filed of 'a' is null then select dataframe 'b' filed). Below codes are absolutely working fine and I am getting required result.

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\ .select( 'a.name_id','a.SUM','a.full_name', f.when(f.isnull(f.col('a.first_name')),f.col('b.first_name')).otherwise(f.col('a.first_name')).alias('first_name'), f.when(f.isnull(f.col('a.last_name')),f.col('b.last_name')).otherwise(f.col('a.last_name')).alias('last_name'), f.when(f.isnull(f.col('a.email')),f.col('b.email')).otherwise(f.col('a.email')).alias('email'), f.when(f.isnull(f.col('a.phone_number')),f.col('b.phone_number')).otherwise(f.col('a.phone_number')).alias('phone_number'), f.when(f.isnull(f.col('a.address')),f.col('b.address')).otherwise(f.col('a.address')).alias('address'), f.when(f.isnull(f.col('a.address_2')),f.col('b.address_2')).otherwise(f.col('a.address_2')).alias('address_2'), f.when(f.isnull(f.col('a.city')),f.col('b.city')).otherwise(f.col('a.city')).alias('city'), f.when(f.isnull(f.col('a.email_alt')),f.col('b.email_alt')).otherwise(f.col('a.email_alt')).alias('email_alt'), 'a.updated','a.date','a.client_reference_code','a.reservation_status',\ 'a.total_cancellations','a.total_covers','a.total_noshows','a.total_spend',\ 'a.total_spend_per_cover','a.total_spend_per_visit','a.total_visits','a.id')

I wonder if number of fields increase over the time then how I will handle these codes using loop so that I can automate it.

I tried below code but getting error, Can anyone help?

col_list = [all required fields]

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left')\ .select('a.name_id','a.SUM','a.full_name',\ for x in col_list: f.when(f.isnull(f.col('a.x')),f.col('b.x')).otherwise(f.col('a.x')).alias('x'), )

I think in select I cant use loop , Please suggest me other way.strong text

aamirmalik127 · Accepted Answer · 2020-11-17 07:33:57Z

Add required columns or column expressions in list & then pass that list to select.

Check below code.

col_list = [all required fields]

Using when function

colExpr = ['a.name_id','a.SUM','a.full_name'] + list(map(lambda x: f.when(f.isnull(f.col('a.x')),f.col('b.x')).otherwise(f.col('a.x')).alias('x'),col_list))

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left').select(*colExpr) # select

Using nvl function

colExpr = ['a.name_id','a.SUM','a.full_name'] + list(map(lambda x: "nvl(a.{},b.{}) as {}".format(x,x,x),col_list))

df_final = df1.alias('a').join(df2.alias('b'), on=['name_id_forwarded'], how='left').selectExpr(*colExpr) # selectExpr

Collectives™ on Stack Overflow

Pyspark Code optimization - to handle it in better way

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related