0

in Pyspark, I have a dataframe spark in this format :

CODE | TITLE | POSITION A | per | 1 A | eis | 3 A | fon | 4 A | dat | 5 B | jem | 2 B | neu | 3 B | tri | 5 B | nok | 6 

and I want to have that :

CODE | TITLE | POSITION A | per | 1 A | eis | 2 A | fon | 3 A | dat | 4 B | jem | 1 B | neu | 2 B | tri | 3 B | nok | 4 

the idea is that the column position starts at 1, and for example for the CODE A, it starts with 1 and I have the position 2 missing, then I need to make 3-1=>2, 4-1=>3 and 5=>4

how can we do that in pyspark ?

thank you for your help

0

1 Answer 1

0

With a slightly simpler dataframe

df.show() +----+-----+--------+ |CODE|TITLE|POSITION| +----+-----+--------+ | A| AA| 1| | A| BB| 3| | A| CC| 4| | A| DD| 5| | B| EE| 2| | B| FF| 3| | B| GG| 5| | B| HH| 6| +----+-----+--------+ from pyspark.sql.functions import row_number from pyspark.sql.window import Window df.withColumn('POSITION', row_number().over(Window.partitionBy('CODE').orderBy('POSITION'))).show() +----+-----+--------+ |CODE|TITLE|POSITION| +----+-----+--------+ | B| EE| 1| | B| FF| 2| | B| GG| 3| | B| HH| 4| | A| AA| 1| | A| BB| 2| | A| CC| 3| | A| DD| 4| +----+-----+--------+ 
Sign up to request clarification or add additional context in comments.

Comments