0

I have a data frame in pyspark like below.

df.show()

+---+-------------+ | id| device| +---+-------------+ | 3| mac pro| | 1| iphone| | 1|android phone| | 1| windows pc| | 1| spy camera| | 2| spy camera| | 2| iphone| | 3| spy camera| | 3| cctv| +---+-------------+ phone_list = ['iphone', 'android phone', 'nokia'] pc_list = ['windows pc', 'mac pro'] security_list = ['spy camera', 'cctv'] from pyspark.sql.functions import col phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones") 

phones_df.show()

+---+------+ | id|phones| +---+------+ | 1| 2| | 2| 1| +---+------+ pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc") 

pc_df.show()

+---+---+ | id| pc| +---+---+ | 1| 1| | 3| 1| +---+---+ security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security") 

security_df.show()

+---+--------+ | id|security| +---+--------+ | 1| 1| | 2| 1| | 3| 2| +---+--------+ 

Then I want to do a full outer join on all the three data frames. I have done like below.

full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc) final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security) 

Final_df.show()

+---+------+----+--------+ | id|phones| pc|security| +---+------+----+--------+ | 1| 2| 1| 1| | 2| 1|null| 1| | 3| null| 1| 2| +---+------+----+--------+ 

I am able to get what I want but want to simplify my code.

1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this. 2) I want to simplify the join statements to one statement 

How can I do this? Could anyone explain.

1 Answer 1

1

Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:

import pyspark.sql.functions as F df.withColumn('cat', F.when(df.device.isin(phone_list), 'phones').otherwise( F.when(df.device.isin(pc_list), 'pc').otherwise( F.when(df.device.isin(security_list), 'security'))) ).groupBy('id').pivot('cat').agg(F.count('cat')).show() +---+----+------+--------+ | id| pc|phones|security| +---+----+------+--------+ | 1| 1| 2| 1| | 3| 1| null| 2| | 2|null| 1| 1| +---+----+------+--------+ 
Sign up to request clarification or add additional context in comments.

1 Comment

Just a small doubt here if I want to do a average of devices in last 10 days for each id how can i do that. The df has records for last 10 days

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.