Assuming I am having the following dataframe:
dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)] df = sc.parallelize(dummy_data).toDF(['letter','number']) And i want to create the following dataframe:
[('a',0),('b',2),('c',1),('d',3),('e',0)] What I do is to convert it to rdd and use zipWithIndex function and after join the results:
convertDF = (df.select('number') .distinct() .rdd .zipWithIndex() .map(lambda x:(x[0].number,x[1])) .toDF(['old','new'])) finalDF = (df .join(convertDF,df.number == convertDF.old) .select(df.letter,convertDF.new)) Is if there is something similar function as zipWIthIndex in dataframes? Is there another more efficient way to do this task?