2

I have the region wise source of customer data like below

region,source,consumer_id APAC,mail,1 APAC,referral,2 APAC,mail,3 APAC,referral,5 APAC,mail,6 APAC,referral,7 APAC,referral,8 US East,mail,9 US East,referral,10 US East,walkIn,11 AUS,walkIn,12 AUS,referral,13 

Can someone help to get the region wise source count like below using pyspark dataframes.

region,mail_source_cnt, referral_source_cnt, walkIn_source_cnt APAC,3,4,0 US EAST,1,1,1 AUS,0,1,1 

Thanks for help

0

1 Answer 1

2

You can aggregate to get count and pivot the columns as,

>>> from pyspark.sql import functions as F >>> df.show() +-------+--------+-----------+ | region| source|consumer_id| +-------+--------+-----------+ | APAC| mail| 1| | APAC|referral| 2| | APAC| mail| 3| | APAC|referral| 5| | APAC| mail| 6| | APAC|referral| 7| | APAC|referral| 8| |US East| mail| 9| |US East|referral| 10| |US East| walkIn| 11| | AUS| walkIn| 12| | AUS|referral| 13| +-------+--------+-----------+ >>> df1 = df.groupby('region','source').count() >>> df1.show() +-------+--------+-----+ | region| source|count| +-------+--------+-----+ |US East| walkIn| 1| | AUS| walkIn| 1| | APAC| mail| 3| | APAC|referral| 4| |US East| mail| 1| |US East|referral| 1| | AUS|referral| 1| +-------+--------+-----+ >>> df2 = df1.withColumn('ccol',F.concat(df1['source'],F.lit('_cnt'))).groupby('region').pivot('ccol').agg(F.first('count')).fillna(0) >>> df2.show() +-------+--------+------------+----------+ | region|mail_cnt|referral_cnt|walkIn_cnt| +-------+--------+------------+----------+ | AUS| 0| 1| 1| | APAC| 3| 4| 0| |US East| 1| 1| 1| +-------+--------+------------+----------+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.