Simplify code and reduce join statements in pyspark data frames

Question

I have a data frame in pyspark like below.

df.show()

+---+-------------+ | id| device| +---+-------------+ | 3| mac pro| | 1| iphone| | 1|android phone| | 1| windows pc| | 1| spy camera| | 2| spy camera| | 2| iphone| | 3| spy camera| | 3| cctv| +---+-------------+ phone_list = ['iphone', 'android phone', 'nokia'] pc_list = ['windows pc', 'mac pro'] security_list = ['spy camera', 'cctv'] from pyspark.sql.functions import col phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")

phones_df.show()

+---+------+ | id|phones| +---+------+ | 1| 2| | 2| 1| +---+------+ pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")

pc_df.show()

+---+---+ | id| pc| +---+---+ | 1| 1| | 3| 1| +---+---+ security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")

security_df.show()

+---+--------+ | id|security| +---+--------+ | 1| 1| | 2| 1| | 3| 2| +---+--------+

Then I want to do a full outer join on all the three data frames. I have done like below.

full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc) final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)

Final_df.show()

+---+------+----+--------+ | id|phones| pc|security| +---+------+----+--------+ | 1| 2| 1| 1| | 2| 1|null| 1| | 3| null| 1| 2| +---+------+----+--------+

I am able to get what I want but want to simplify my code.

1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this. 2) I want to simplify the join statements to one statement

How can I do this? Could anyone explain.

akuiper · Accepted Answer · 2018-05-16 00:20:35Z

Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:

import pyspark.sql.functions as F df.withColumn('cat', F.when(df.device.isin(phone_list), 'phones').otherwise( F.when(df.device.isin(pc_list), 'pc').otherwise( F.when(df.device.isin(security_list), 'security'))) ).groupBy('id').pivot('cat').agg(F.count('cat')).show() +---+----+------+--------+ | id| pc|phones|security| +---+----+------+--------+ | 1| 1| 2| 1| | 3| 1| null| 2| | 2|null| 1| 1| +---+----+------+--------+

Just a small doubt here if I want to do a average of devices in last 10 days for each id how can i do that. The df has records for last 10 days

Collectives™ on Stack Overflow

Simplify code and reduce join statements in pyspark data frames

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related