Pandas DataFrame GroupBy Rank

Question

DataFrame:

 account_id plan_id policy_group_nbr plan_type split_eff_date splits 0 470804 8739131 Conversion732 Onsite Medical Center 1/19/2022 Bob Smith (28.2) | John Doe (35.9) | A... 1 470804 8739131 Conversion732 Onsite Medical Center 1/21/2022 Bob Smith (19.2) | John Doe (34.6) | A... 2 470809 2644045 801790 401(k) 1/18/2022 Jim Jones (100) 3 470809 2644045 801790 401(k) 1/5/2022 Other Name (50) | Jim Jones (50) 4 470809 2738854 801789 401(k) 1/18/2022 Jim Jones (100) ... ... ... ... ... ... ... 1720 3848482 18026734 24794 Accident 1/20/2022 Bill Underwood (50) | Jim Jones (50) 1721 3848482 18026781 BCSC FSA Admin 1/20/2022 Bill Underwood (50) | Jim Jones (50) 1722 3927880 19602958 Consulting Other 1/20/2022 Bill Brown (50) | Tim Scott (50) 1723 3927880 19863300 Producer Expense 5500 Filing 1/20/2022 Bill Brown (50) | Tim Scott (50) 1724 3927880 19863300 Producer Expense 5500 Filing 1/21/2022 Bill Brown (50) | Tim Scott (50)

I need to group by (account_id, plan_id, policy_group_nbr, plan_type) sorted by split_eff_date (desc), in order to remove all rows for the group but the most recent date while maintaining all columns. I can get a rank however, when attempting to pass an argument to the lambda function, I'm receiving a TypeError.

working as expected:

splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values().rank())

TypeError: incompatible index of inserted column with frame index

splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values(ascending=False).rank())

passing the axis argument didn't seem to help either... is this a simple syntax issue, or am I not understanding the function properly?

mechanical_meat · Accepted Answer · 2022-02-05 19:04:39Z

easier -- and typically faster -- to do this with .transform().

easier because when you sort descending, the index doesn't match when you try to assign back to original DataFrame. i tried not using an index in the .groupby(), but wasn't able to get that working.

link to documentation about .transform(): https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.transform.html

i recommend using .transform() like this, and be sure to supply ascending=False kwarg to .rank() as well:

df["rank2"] = df.groupby(["account_id", "plan_id", "policy_group_nbr", "plan_type"])[ "split_eff_date" ].transform( lambda x: x.sort_values(ascending=False).rank(ascending=False, method="first") )

result with both kinds of ranking -- i took just the first 5 rows from your sample data:

In [93]: df Out[93]: account_id plan_id policy_group_nbr plan_type split_eff_date rank rank2 3 470809 2644045 801790 401(k) 2022-01-05 1.0 2.0 2 470809 2644045 801790 401(k) 2022-01-18 2.0 1.0 4 470809 2738854 801789 401(k) 2022-01-18 1.0 1.0 0 470804 8739131 Conversion732 Onsite Medical Center 2022-01-19 1.0 2.0 1 470804 8739131 Conversion732 Onsite Medical Center 2022-01-21 2.0 1.0

Collectives™ on Stack Overflow

Pandas DataFrame GroupBy Rank

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related