Tricky Multiple aggregation In pyspark

Question

I have a table with three columns:

product names
product use case/usage
user ID

I want to extract for each product, all the use-cases. Then for each of these use-cases, the percentage of users using the product. Here is an example of the data:

product-name use-case user-ID A therapy X B relaxation X C health Y A relaxation Z

I want to groupby the product names.
Then for each product name I want to groupby the use-cases.
Then for each use case (related to a product name) I want to see the percentage of users (i.e. based on their user-IDs). My desired result is to say that xx% of product A's users are using this product for relaxation...

The output should look like:

For example, I can say 50% of Product A users are using it for therapy and the other 50% for relaxation.

Thanks a lot.

E. Ducateme · Accepted Answer · 2018-01-31 12:25:19Z

Aggregate in two steps and then join:

import pyspark.sql.functions as F (df.groupBy(['product-name', 'Use-case']) .count() .withColumnRenamed('count', 'User counts') .join( df.groupBy('product-name').count(), ['product-name'] ).withColumn('User counts', F.col('User counts')/F.col('count')) .drop('count').show()) +------------+----------+-----------+ |product-name| Use-case|User counts| +------------+----------+-----------+ | B|Relaxation| 1.0| | C| health| 1.0| | A| therapy| 0.5| | A|relaxation| 0.5| +------------+----------+-----------+

Hi, what if I want to keep the user ID column ? Psidom thanks a lot
Thanks @Psidom and it worked for me I hope it will help others

Collectives™ on Stack Overflow

Tricky Multiple aggregation In pyspark

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related