0

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.

# Normal way of creating dataframe in pyspark sdataframe_temp = spark.createDataFrame([ (2,2,'0-2'), (2,23,'22-24')], ['a', 'b', 'c'] ) sdataframe_temp2 = spark.createDataFrame([ (4,6,'4-6'), (5,7,'6-8')], ['a', 'b', 'c'] ) # Concat two different pyspark dataframe sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2) sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'}) sdataframe_union_1_2_g.show() 

output:

+---+------+ | a|max(b)| +---+------+ | 5| 7| | 2| 23| | 4| 6| +---+------+ 

Expected output:

+---+------+-----+ | a|max(b)| c | +---+------+-----+ | 5| 7|6-8 | | 2| 23|22-24| | 4| 6|4-6 | +---+------+---+ 

1 Answer 1

1

You can use a Window function to make it work:

Method 1: Using Window function

import pyspark.sql.functions as F from pyspark.sql.window import Window w = Window().partitionBy("a").orderBy(F.desc("b")) (sdataframe_union_1_2 .withColumn('max_val', F.row_number().over(w) == 1) .where("max_val == True") .drop("max_val") .show()) +---+---+-----+ | a| b| c| +---+---+-----+ | 5| 7| 6-8| | 2| 23|22-24| | 4| 6| 4-6| +---+---+-----+ 

Explanation

  1. Window functions are useful when we want to attach a new column to the existing set of columns.
  2. In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
  3. Then we use F.row_number() to filter the max values where row number equals 1.
  4. Finally, we drop the new column since it is not being used after filtering the data frame.

Method 2: Using groupby + inner join

f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b')) sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show() +---+---+-----+ | a| b| c| +---+---+-----+ | 2| 23|22-24| | 5| 7| 6-8| | 4| 6| 4-6| +---+---+-----+ 
Sign up to request clarification or add additional context in comments.

6 Comments

Can you explain a little about the method you used? I see that my groupby and agg functions are not used and still get the correct answer.
Thanks! But will it be possible to use my method but at the same time using the window method too> I wish to learn more different ways of doing it. That only if it is possible to do it. thank you again
that's good, you can do that with more steps. after you have grouped df, you need to inner join it with sdataframe_union_1_2 to get the result.
I am so sorry but could you include it in your above answer?
done, please check. Do accept the answer it it helps:)
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.