26

Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So

Group Date A 2000 A 2002 A 2007 B 1999 B 2015 

Would become

Group Date row_num A 2000 0 A 2002 1 A 2007 2 B 1999 0 B 2015 1 
2
  • 1
    Why all the hate? Commented Aug 4, 2017 at 19:40
  • 4
    Unfortunately, there is the wrong impression that a question must include code tested by yourself (and didn't work), although according to the SO guidelines for asking, this is certainly not the case: stackoverflow.com/help/on-topic Commented Aug 4, 2017 at 22:08

2 Answers 2

38

Use window function:

from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) 
Sign up to request clarification or add additional context in comments.

3 Comments

Nice! I inserted a missing comma inside withColumn... :)
Welcome to SO and congrats for answering your first question! Keep on and don't get disappointed (it can be a harsh place occasionally...) - check also my edit to see how you can use code highlighting
@desertnaut can we do the ordering reserving the natural order of the dataframe rather than orderby ?
7

The accepted solution almost has it right. Here is the solution based on the output requested in the question:

df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"]) +-----+----+ |Group|Date| +-----+----+ | A|2000| | A|2002| | A|2007| | B|1999| | B|2015| +-----+----+ # accepted solution above from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) # accepted solution above output +-----+----+-------+ |Group|Date|row_num| +-----+----+-------+ | B|1999| 1| | B|2015| 2| | A|2000| 1| | A|2002| 2| | A|2007| 3| +-----+----+-------+ 

As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:

df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show() 

Output :

+-----+----+-------+ |Group|Date|row_num| +-----+----+-------+ | B|1999| 0| | B|2015| 1| | A|2000| 0| | A|2002| 1| | A|2007| 2| +-----+----+-------+ 

Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.