0

I have a table with three columns:

  • product names
  • product use case/usage
  • user ID

I want to extract for each product, all the use-cases. Then for each of these use-cases, the percentage of users using the product. Here is an example of the data:

product-name use-case user-ID A therapy X B relaxation X C health Y A relaxation Z 
  1. I want to groupby the product names.
  2. Then for each product name I want to groupby the use-cases.
  3. Then for each use case (related to a product name) I want to see the percentage of users (i.e. based on their user-IDs). My desired result is to say that xx% of product A's users are using this product for relaxation...

The output should look like:

enter image description here

For example, I can say 50% of Product A users are using it for therapy and the other 50% for relaxation.

Thanks a lot.

1 Answer 1

1

Aggregate in two steps and then join:

import pyspark.sql.functions as F (df.groupBy(['product-name', 'Use-case']) .count() .withColumnRenamed('count', 'User counts') .join( df.groupBy('product-name').count(), ['product-name'] ).withColumn('User counts', F.col('User counts')/F.col('count')) .drop('count').show()) +------------+----------+-----------+ |product-name| Use-case|User counts| +------------+----------+-----------+ | B|Relaxation| 1.0| | C| health| 1.0| | A| therapy| 0.5| | A|relaxation| 0.5| +------------+----------+-----------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

Hi, what if I want to keep the user ID column ? Psidom thanks a lot
Thanks @Psidom and it worked for me I hope it will help others

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.