PySpark - get row number for each row in a group

Question

Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So

Group Date A 2000 A 2002 A 2007 B 1999 B 2015

Would become

Group Date row_num A 2000 0 A 2002 1 A 2007 2 B 1999 0 B 2015 1

Unfortunately, there is the wrong impression that a question must include code tested by yourself (and didn't work), although according to the SO guidelines for asking, this is certainly not the case: stackoverflow.com/help/on-topic — desertnaut
– desertnaut, Commented Aug 4, 2017 at 22:08

desertnaut · Accepted Answer · 2017-08-04 20:05:20Z

38

Use window function:

from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))

edited Aug 4, 2017 at 20:05

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

answered Aug 4, 2017 at 19:17

user8419108

3964 silver badges2 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

desertnaut Over a year ago

Nice! I inserted a missing comma inside withColumn... :)

desertnaut Over a year ago

Welcome to SO and congrats for answering your first question! Keep on and don't get disappointed (it can be a harsh place occasionally...) - check also my edit to see how you can use code highlighting

PolarBear10 Over a year ago

@desertnaut can we do the ordering reserving the natural order of the dataframe rather than orderby ?

Rahul P · Accepted Answer · 2018-07-24 19:34:41Z

The accepted solution almost has it right. Here is the solution based on the output requested in the question:

df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"]) +-----+----+ |Group|Date| +-----+----+ | A|2000| | A|2002| | A|2007| | B|1999| | B|2015| +-----+----+ # accepted solution above from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) # accepted solution above output +-----+----+-------+ |Group|Date|row_num| +-----+----+-------+ | B|1999| 1| | B|2015| 2| | A|2000| 1| | A|2002| 2| | A|2007| 3| +-----+----+-------+

As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:

df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()

Output :

+-----+----+-------+ |Group|Date|row_num| +-----+----+-------+ | B|1999| 0| | B|2015| 1| | A|2000| 0| | A|2002| 1| | A|2007| 2| +-----+----+-------+

Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.

Collectives™ on Stack Overflow

PySpark - get row number for each row in a group

2 Answers 2

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Linked

Related