0

In this example, I have the following dataframe:

client_id rule_1 rule_2 rule_3 rule_4 rule_5 1 1 0 1 0 0 2 0 1 0 0 0 3 0 1 1 1 0 4 1 0 1 1 1 

It shows the client_id and if he's obeying a certain rule or not.

How would I be able to count the number of clients that obey each rule, in a way that I can show all information in one dataframe?

rule obeys count rule_1 0 23852 rule_1 1 95102 rule_2 0 12942 rule_2 1 45884 rule_3 0 29319 rule_3 1 9238 rule_4 0 55321 rule_4 1 23013 rule_5 0 96842 rule_5 1 86739 

2 Answers 2

1

The operation of moving column names to rows is called unpivoting. In Spark, it is done using stack function.

Input:

from pyspark.sql import functions as F df = spark.createDataFrame( [(1, 1, 0, 1, 0, 0), (2, 0, 1, 0, 0, 0), (3, 0, 1, 1, 1, 0), (4, 1, 0, 1, 1, 1)], ["client_id", "rule_1", "rule_2", "rule_3", "rule_4", "rule_5"]) 

Script:

to_unpivot = [f"\'{c}\', `{c}`" for c in df.columns if c != "client_id"] stack_str = ",".join(to_unpivot) df = (df .select(F.expr(f"stack({len(to_unpivot)}, {stack_str}) as (rule, obeys)")) .groupBy("rule", "obeys") .count() ) df.show() # +------+-----+-----+ # | rule|obeys|count| # +------+-----+-----+ # |rule_1| 1| 2| # |rule_2| 1| 2| # |rule_1| 0| 2| # |rule_3| 1| 3| # |rule_2| 0| 2| # |rule_4| 0| 2| # |rule_3| 0| 1| # |rule_5| 0| 3| # |rule_5| 1| 1| # |rule_4| 1| 2| # +------+-----+-----+ 
Sign up to request clarification or add additional context in comments.

Comments

1

We can transpose down the rule columns and take a count. Here's an example using the sample in your question.

rule_cols = [k for k in data_sdf.columns if 'rule' in k] # ['rule_1', 'rule_2', 'rule_3', 'rule_4', 'rule_5'] data_sdf. \ withColumn('arr_rule_structs', func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) for k in rule_cols]) ). \ selectExpr('id', 'inline(arr_rule_structs)'). \ groupBy('key', 'val'). \ agg(func.count('id').alias('cnt')). \ orderBy('key', 'val'). \ show() # +------+---+---+ # | key|val|cnt| # +------+---+---+ # |rule_1| 0| 2| # |rule_1| 1| 2| # |rule_2| 0| 2| # |rule_2| 1| 2| # |rule_3| 0| 1| # |rule_3| 1| 3| # |rule_4| 0| 2| # |rule_4| 1| 2| # |rule_5| 0| 3| # |rule_5| 1| 1| # +------+---+---+ 

Feel free to use your own field names within the struct instead of key and val.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.