I have a data frame in pyspark like below.
df.show()
+---+-------------+ | id| device| +---+-------------+ | 3| mac pro| | 1| iphone| | 1|android phone| | 1| windows pc| | 1| spy camera| | 2| spy camera| | 2| iphone| | 3| spy camera| | 3| cctv| +---+-------------+ phone_list = ['iphone', 'android phone', 'nokia'] pc_list = ['windows pc', 'mac pro'] security_list = ['spy camera', 'cctv'] from pyspark.sql.functions import col phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones") phones_df.show()
+---+------+ | id|phones| +---+------+ | 1| 2| | 2| 1| +---+------+ pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc") pc_df.show()
+---+---+ | id| pc| +---+---+ | 1| 1| | 3| 1| +---+---+ security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security") security_df.show()
+---+--------+ | id|security| +---+--------+ | 1| 1| | 2| 1| | 3| 2| +---+--------+ Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc) final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security) Final_df.show()
+---+------+----+--------+ | id|phones| pc|security| +---+------+----+--------+ | 1| 2| 1| 1| | 2| 1|null| 1| | 3| null| 1| 2| +---+------+----+--------+ I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this. 2) I want to simplify the join statements to one statement How can I do this? Could anyone explain.