pyspark dataframe using group to get multiple fields count [duplicate]

Question

I have the region wise source of customer data like below

region,source,consumer_id APAC,mail,1 APAC,referral,2 APAC,mail,3 APAC,referral,5 APAC,mail,6 APAC,referral,7 APAC,referral,8 US East,mail,9 US East,referral,10 US East,walkIn,11 AUS,walkIn,12 AUS,referral,13

Can someone help to get the region wise source count like below using pyspark dataframes.

region,mail_source_cnt, referral_source_cnt, walkIn_source_cnt APAC,3,4,0 US EAST,1,1,1 AUS,0,1,1

Thanks for help

Suresh · Accepted Answer · 2018-04-05 13:21:15Z

You can aggregate to get count and pivot the columns as,

>>> from pyspark.sql import functions as F >>> df.show() +-------+--------+-----------+ | region| source|consumer_id| +-------+--------+-----------+ | APAC| mail| 1| | APAC|referral| 2| | APAC| mail| 3| | APAC|referral| 5| | APAC| mail| 6| | APAC|referral| 7| | APAC|referral| 8| |US East| mail| 9| |US East|referral| 10| |US East| walkIn| 11| | AUS| walkIn| 12| | AUS|referral| 13| +-------+--------+-----------+ >>> df1 = df.groupby('region','source').count() >>> df1.show() +-------+--------+-----+ | region| source|count| +-------+--------+-----+ |US East| walkIn| 1| | AUS| walkIn| 1| | APAC| mail| 3| | APAC|referral| 4| |US East| mail| 1| |US East|referral| 1| | AUS|referral| 1| +-------+--------+-----+ >>> df2 = df1.withColumn('ccol',F.concat(df1['source'],F.lit('_cnt'))).groupby('region').pivot('ccol').agg(F.first('count')).fillna(0) >>> df2.show() +-------+--------+------------+----------+ | region|mail_cnt|referral_cnt|walkIn_cnt| +-------+--------+------------+----------+ | AUS| 0| 1| 1| | APAC| 3| 4| 0| |US East| 1| 1| 1| +-------+--------+------------+----------+

Collectives™ on Stack Overflow

pyspark dataframe using group to get multiple fields count [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related