Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So
Group Date A 2000 A 2002 A 2007 B 1999 B 2015 Would become
Group Date row_num A 2000 0 A 2002 1 A 2007 2 B 1999 0 B 2015 1 Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So
Group Date A 2000 A 2002 A 2007 B 1999 B 2015 Would become
Group Date row_num A 2000 0 A 2002 1 A 2007 2 B 1999 0 B 2015 1 Use window function:
from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) withColumn... :)orderby ?The accepted solution almost has it right. Here is the solution based on the output requested in the question:
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"]) +-----+----+ |Group|Date| +-----+----+ | A|2000| | A|2002| | A|2007| | B|1999| | B|2015| +-----+----+ # accepted solution above from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) # accepted solution above output +-----+----+-------+ |Group|Date|row_num| +-----+----+-------+ | B|1999| 1| | B|2015| 2| | A|2000| 1| | A|2002| 2| | A|2007| 3| +-----+----+-------+ As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show() Output :
+-----+----+-------+ |Group|Date|row_num| +-----+----+-------+ | B|1999| 0| | B|2015| 1| | A|2000| 0| | A|2002| 1| | A|2007| 2| +-----+----+-------+ Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.