Show all pyspark columns after group and agg

Question

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.

# Normal way of creating dataframe in pyspark sdataframe_temp = spark.createDataFrame([ (2,2,'0-2'), (2,23,'22-24')], ['a', 'b', 'c'] ) sdataframe_temp2 = spark.createDataFrame([ (4,6,'4-6'), (5,7,'6-8')], ['a', 'b', 'c'] ) # Concat two different pyspark dataframe sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2) sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'}) sdataframe_union_1_2_g.show()

output:

+---+------+ | a|max(b)| +---+------+ | 5| 7| | 2| 23| | 4| 6| +---+------+

Expected output:

+---+------+-----+ | a|max(b)| c | +---+------+-----+ | 5| 7|6-8 | | 2| 23|22-24| | 4| 6|4-6 | +---+------+---+

YOLO · Accepted Answer · 2020-01-19 06:55:00Z

You can use a Window function to make it work:

Method 1: Using Window function

import pyspark.sql.functions as F from pyspark.sql.window import Window w = Window().partitionBy("a").orderBy(F.desc("b")) (sdataframe_union_1_2 .withColumn('max_val', F.row_number().over(w) == 1) .where("max_val == True") .drop("max_val") .show()) +---+---+-----+ | a| b| c| +---+---+-----+ | 5| 7| 6-8| | 2| 23|22-24| | 4| 6| 4-6| +---+---+-----+

Explanation

Window functions are useful when we want to attach a new column to the existing set of columns.
In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
Then we use F.row_number() to filter the max values where row number equals 1.
Finally, we drop the new column since it is not being used after filtering the data frame.

Method 2: Using groupby + inner join

f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b')) sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show() +---+---+-----+ | a| b| c| +---+---+-----+ | 2| 23|22-24| | 5| 7| 6-8| | 4| 6| 4-6| +---+---+-----+

Can you explain a little about the method you used? I see that my groupby and agg functions are not used and still get the correct answer.
Thanks! But will it be possible to use my method but at the same time using the window method too> I wish to learn more different ways of doing it. That only if it is possible to do it. thank you again
that's good, you can do that with more steps. after you have grouped df, you need to inner join it with sdataframe_union_1_2 to get the result.
I am so sorry but could you include it in your above answer?

Collectives™ on Stack Overflow

Show all pyspark columns after group and agg

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related