I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark sdataframe_temp = spark.createDataFrame([ (2,2,'0-2'), (2,23,'22-24')], ['a', 'b', 'c'] ) sdataframe_temp2 = spark.createDataFrame([ (4,6,'4-6'), (5,7,'6-8')], ['a', 'b', 'c'] ) # Concat two different pyspark dataframe sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2) sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'}) sdataframe_union_1_2_g.show() output:
+---+------+ | a|max(b)| +---+------+ | 5| 7| | 2| 23| | 4| 6| +---+------+ Expected output:
+---+------+-----+ | a|max(b)| c | +---+------+-----+ | 5| 7|6-8 | | 2| 23|22-24| | 4| 6|4-6 | +---+------+---+