2

DataFrame:

 account_id plan_id policy_group_nbr plan_type split_eff_date splits 0 470804 8739131 Conversion732 Onsite Medical Center 1/19/2022 Bob Smith (28.2) | John Doe (35.9) | A... 1 470804 8739131 Conversion732 Onsite Medical Center 1/21/2022 Bob Smith (19.2) | John Doe (34.6) | A... 2 470809 2644045 801790 401(k) 1/18/2022 Jim Jones (100) 3 470809 2644045 801790 401(k) 1/5/2022 Other Name (50) | Jim Jones (50) 4 470809 2738854 801789 401(k) 1/18/2022 Jim Jones (100) ... ... ... ... ... ... ... 1720 3848482 18026734 24794 Accident 1/20/2022 Bill Underwood (50) | Jim Jones (50) 1721 3848482 18026781 BCSC FSA Admin 1/20/2022 Bill Underwood (50) | Jim Jones (50) 1722 3927880 19602958 Consulting Other 1/20/2022 Bill Brown (50) | Tim Scott (50) 1723 3927880 19863300 Producer Expense 5500 Filing 1/20/2022 Bill Brown (50) | Tim Scott (50) 1724 3927880 19863300 Producer Expense 5500 Filing 1/21/2022 Bill Brown (50) | Tim Scott (50) 

I need to group by (account_id, plan_id, policy_group_nbr, plan_type) sorted by split_eff_date (desc), in order to remove all rows for the group but the most recent date while maintaining all columns. I can get a rank however, when attempting to pass an argument to the lambda function, I'm receiving a TypeError.

working as expected:

splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values().rank()) 

TypeError: incompatible index of inserted column with frame index

splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values(ascending=False).rank()) 

passing the axis argument didn't seem to help either... is this a simple syntax issue, or am I not understanding the function properly?

1 Answer 1

1

easier -- and typically faster -- to do this with .transform().

easier because when you sort descending, the index doesn't match when you try to assign back to original DataFrame. i tried not using an index in the .groupby(), but wasn't able to get that working.

link to documentation about .transform(): https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.transform.html

i recommend using .transform() like this, and be sure to supply ascending=False kwarg to .rank() as well:

df["rank2"] = df.groupby(["account_id", "plan_id", "policy_group_nbr", "plan_type"])[ "split_eff_date" ].transform( lambda x: x.sort_values(ascending=False).rank(ascending=False, method="first") ) 

result with both kinds of ranking -- i took just the first 5 rows from your sample data:

In [93]: df Out[93]: account_id plan_id policy_group_nbr plan_type split_eff_date rank rank2 3 470809 2644045 801790 401(k) 2022-01-05 1.0 2.0 2 470809 2644045 801790 401(k) 2022-01-18 2.0 1.0 4 470809 2738854 801789 401(k) 2022-01-18 1.0 1.0 0 470804 8739131 Conversion732 Onsite Medical Center 2022-01-19 1.0 2.0 1 470804 8739131 Conversion732 Onsite Medical Center 2022-01-21 2.0 1.0 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.