PySpark: Incremental Row Counter

Question

I am having difficulty implementing this existing answer: PySpark - get row number for each row in a group

Consider the following:

# create df df = spark.createDataFrame(sc.parallelize([ [1, 'A', 20220722, 1], [1, 'A', 20220723, 1], [1, 'B', 20220724, 2], [2, 'B', 20220722, 1], [2, 'C', 20220723, 2], [2, 'B', 20220724, 3], ]), ['ID', 'State', 'Time', 'Expected']) # rank w = Window.partitionBy('State').orderBy('ID', 'Time') df = df.withColumn('rn', F.row_number().over(w)) df = df.withColumn('rank', F.rank().over(w)) df = df.withColumn('dense', F.dense_rank().over(w)) # view df.show()

+---+-----+--------+--------+---+----+-----+ | ID|State| Time|Expected| rn|rank|dense| +---+-----+--------+--------+---+----+-----+ | 1| A|20220722| 1| 1| 1| 1| | 1| A|20220723| 1| 2| 2| 2| | 1| B|20220724| 2| 1| 1| 1| | 2| B|20220722| 1| 2| 2| 2| | 2| B|20220724| 3| 3| 3| 3| | 2| C|20220723| 2| 1| 1| 1| +---+-----+--------+--------+---+----+-----+

How can I get the expected value and also sort the dates correctly such that they are ascending?

How come row num of A 20220723 is 1, it suppose to be 2. Also C 20220723 suppose to be 1. — pltc
– pltc, Commented Sep 22, 2022 at 2:33

samkart · Accepted Answer · 2022-09-22 05:53:11Z

you restart your count for each new id value, which means the id field is your partition field, not state.

an approach with sum window function.

data_sdf. \ withColumn('st_notsame', func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')), func.lit(True)).cast('int') ). \ withColumn('rank', func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time', 'state').rowsBetween(-sys.maxsize, 0)) ). \ show() # +---+-----+--------+--------+----------+----+ # | id|state| time|expected|st_notsame|rank| # +---+-----+--------+--------+----------+----+ # | 1| A|20220722| 1| 1| 1| # | 1| A|20220723| 1| 0| 1| # | 1| B|20220724| 2| 1| 2| # | 2| B|20220722| 1| 1| 1| # | 2| C|20220723| 2| 1| 2| # | 2| B|20220724| 3| 1| 3| # +---+-----+--------+--------+----------+----+

you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get your desired ranking

Collectives™ on Stack Overflow

PySpark: Incremental Row Counter

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related