Referring to my previous question Here if I trying to compute and add total row, for each brand , parent and week_num (total of usage)
Here is dummy sample :
df0 = spark.createDataFrame( [ (2, "A", "A2", "A2web", 2500), (2, "A", "A2", "A2TV", 3500), (4, "A", "A1", "A2app", 5500), (4, "A", "AD", "ADapp", 2000), (4, "B", "B25", "B25app", 7600), (4, "B", "B26", "B26app", 5600), (5, "C", "c25", "c25app", 2658), (5, "C", "c27", "c27app", 1100), (5, "C", "c28", "c26app", 1200), ], ["week_num", "parent", "brand", "channel", "usage"], ) This snippet add total row per channel
# Group by and sum to get the totals totals = ( df0.groupBy(["week_num", "parent", "brand"]) .agg(f.sum("usage").alias("usage")) .withColumn("channel", f.lit("Total")) ) # create a temp variable to sort totals = totals.withColumn("sort_id", f.lit(2)) df0 = df0.withColumn("sort_id", f.lit(1)) # Union dataframes, drop temp variable and show df1 = df0.unionByName(totals).sort(["week_num", "parent", "brand", "sort_id"]) df1.show() result:
+--------+------+-----+-------+-----+ |week_num|parent|brand|channel|usage| +--------+------+-----+-------+-----+ | 2| A| A2| A2web| 2500| | 2| A| A2| A2TV| 3500| | 2| A| A2| Total| 6000| | 4| A| A1| A2app| 5500| | 4| A| A1| Total| 5500| | 4| A| AD| ADapp| 2000| | 4| A| AD| Total| 2000| | 4| B| B25| B25app| 7600| | 4| B| B25| Total| 7600| | 4| B| B26| B26app| 5600| | 4| B| B26| Total| 5600| | 5| C| c25| c25app| 2658| | 5| C| c25| Total| 2658| | 5| C| c27| c27app| 1100| | 5| C| c27| Total| 1100| | 5| C| c28| c26app| 1200| | 5| C| c28| Total| 1200| +--------+------+-----+-------+-----+ That is ok for channel column, in order to to get something like below, I simply repeat the first process groupby+sum and then union the result back
+--------+------+-----+-------+-----+ |week_num|parent|brand|channel|usage| +--------+------+-----+-------+-----+ | 2| A| A2| A2web| 2500| | 2| A| A2| A2TV| 3500| | 2| A| A2| Total| 6000| | 2| A|Total| | 6000| | 2| Total| | | 6000| Here in two steps
# add brand total row df2 = ( df0.groupBy(["week_num", "parent"]) .agg(f.sum("usage").alias("usage")) .withColumn("brand", f.lit("Total")) .withColumn("channel", f.lit("")) ) df2 = df1.unionByName(df2).sort(["week_num", "parent", "brand", "channel"]) # add weeknum total row df3 = ( df0.groupBy(["week_num"]) .agg(f.sum("usage").alias("usage")) .withColumn("parent", f.lit("Total")) .withColumn("brand", f.lit("")) .withColumn("channel", f.lit("")) ) df3 = df2.unionByName(df3).sort(["week_num", "parent", "brand", "channel"]) result:
+--------+------+-----+-------+-----+ |week_num|parent|brand|channel|usage| +--------+------+-----+-------+-----+ | 2| A| A2| A2TV| 3500| | 2| A| A2| A2web| 2500| | 2| A| A2| Total| 6000| | 2| A|Total| | 6000| | 2| Total| | | 6000| | 4| A| A1| A2app| 5500| | 4| A| A1| Total| 5500| | 4| A| AD| ADapp| 2000| | 4| A| AD| Total| 2000| | 4| A|Total| | 7500| | 4| B| B25| B25app| 7600| | 4| B| B25| Total| 7600| | 4| B| B26| B26app| 5600| | 4| B| B26| Total| 5600| | 4| B|Total| |13200| | 4| Total| | |20700| | 5| C|Total| | 4958| | 5| C| c25| Total| 2658| | 5| C| c25| c25app| 2658| | 5| C| c27| Total| 1100| +--------+------+-----+-------+-----+ First question, is there any alternative approach or more efficient way without repetition? and second, what if I want to show total always at top per each group , regardless of parent/brand/channel alphabetical name, How can I sort this. like this:(this is dummy data but I hope it is clear enough)
+--------+------+-----+-------+-----+ |week_num|parent|brand|channel|usage| +--------+------+-----+-------+-----+ | 2| Total| | | 6000| | 2| A|Total| | 6000| | 2| A| A2| Total| 6000| | 2| A| A2| A2TV| 3500| | 2| A| A2| A2web| 2500| | 4| Total| | |20700| | 4| A|Total| | 7500| | 4| B|Total| |13200| | 4| A| A1| Total| 5500| | 4| A| A1| A2app| 5500| | 4| A| AD| Total| 2000| | 4| A| AD| ADapp| 2000| | 4| B| B25| Total| 7600| | 4| B| B25| B25app| 7600| | 4| B| B26| Total| 5600| | 4| B| B26| B26app| 5600|