0

Generate a matrix of the sum of column values and sum of rows in new column in pyspark dataframe

colors = spark.createDataFrame([("Red","Re",20),("Blue","Bl",30),("Green","Gr",50)]).toDF("Colors","Prefix","Value") +------+------+-----+ |Colors|Prefix|Value| +------+------+-----+ | Red| Re| 20| | Blue| Bl| 30| | Green| Gr| 50| +------+------+-----+ piv = colors.groupby("Colors").pivot("Prefix").sum("Value").fillna(0) piv.withColumn("total",sum(piv[col] for col in piv.columns[1:])).show() +------+---+---+---+-----+ |Colors| Bl| Gr| Re|total| +------+---+---+---+-----+ | Green| 0| 50| 0| 50| | Blue| 30| 0| 0| 30| | Red| 0| 0| 20| 20| +------+---+---+---+-----+ 

Expecting even sum of columns like below (Expected dynamic code like if it has more columns and rows)

 Re Bl Gr TOTAL Red 20 0 0 20 Blue 0 30 0 30 Green 0 0 50 50 TOTAL 20 30 50 100 

1 Answer 1

1

Here is the way. I have used map for doing sum over all the columns.

import pyspark.sql.functions as f df = colors.groupby("Colors").pivot("Prefix").sum("Value").fillna(0) cols = df.columns[1:] df.union(df.agg(f.lit('Total').alias('Color'), *[f.sum(f.col(c)).alias(c) for c in cols])) \ .withColumn("Total", sum(f.col(c) for c in cols)) \ .show() +------+---+---+---+-----+ |Colors| Bl| Gr| Re|Total| +------+---+---+---+-----+ | Green| 0| 50| 0| 50| | Blue| 30| 0| 0| 30| | Red| 0| 0| 20| 20| | Total| 30| 50| 20| 100| +------+---+---+---+-----+ 
Sign up to request clarification or add additional context in comments.

1 Comment

A generator comprehension would be more readable (F.sum(F.col(c)).alias(c) for c in cols) and consistent with the generator comprehension in sum()

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.