62

Given the following dataframe

In [31]: rand = np.random.RandomState(1) df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2, 'B': rand.randn(6), 'C': rand.rand(6) > .5}) In [32]: df Out[32]: A B C 0 foo 1.624345 False 1 bar -0.611756 True 2 baz -0.528172 False 3 foo -1.072969 True 4 bar 0.865408 False 5 baz -2.301539 True 

I would like to sort it in groups (A) by the aggregated sum of B, and then by the value in C (not aggregated). So basically get the order of the A groups with

In [28]: df.groupby('A').sum().sort('B') Out[28]: B C A baz -2.829710 1 bar 0.253651 1 foo 0.551377 1 

And then by True/False, so that it ultimately looks like this:

In [30]: df.ix[[5, 2, 1, 4, 3, 0]] Out[30]: A B C 5 baz -2.301539 True 2 baz -0.528172 False 1 bar -0.611756 True 4 bar 0.865408 False 3 foo -1.072969 True 0 foo 1.624345 False 

How can this be done?

4 Answers 4

64

Groupby A:

In [0]: grp = df.groupby('A') 

Within each group, sum over B and broadcast the values using transform. Then sort by B:

In [1]: grp[['B']].transform(sum).sort('B') Out[1]: B 2 -2.829710 5 -2.829710 1 0.253651 4 0.253651 0 0.551377 3 0.551377 

Index the original df by passing the index from above. This will re-order the A values by the aggregate sum of the B values:

In [2]: sort1 = df.ix[grp[['B']].transform(sum).sort('B').index] In [3]: sort1 Out[3]: A B C 2 baz -0.528172 False 5 baz -2.301539 True 1 bar -0.611756 True 4 bar 0.865408 False 0 foo 1.624345 False 3 foo -1.072969 True 

Finally, sort the 'C' values within groups of 'A' using the sort=False option to preserve the A sort order from step 1:

In [4]: f = lambda x: x.sort('C', ascending=False) In [5]: sort2 = sort1.groupby('A', sort=False).apply(f) In [6]: sort2 Out[6]: A B C A baz 5 baz -2.301539 True 2 baz -0.528172 False bar 1 bar -0.611756 True 4 bar 0.865408 False foo 3 foo -1.072969 True 0 foo 1.624345 False 

Clean up the df index by using reset_index with drop=True:

In [7]: sort2.reset_index(0, drop=True) Out[7]: A B C 5 baz -2.301539 True 2 baz -0.528172 False 1 bar -0.611756 True 4 bar 0.865408 False 3 foo -1.072969 True 0 foo 1.624345 False 
Sign up to request clarification or add additional context in comments.

4 Comments

Also, I assumed that groupby's sort=False flag would return an arbitrary, not necessarily sorted order (I guess I was associating them with python dictionaries for some reason). But this answer implies that the flag is guaranteed to preserve the original order of the dataframe rows?
I'm 99% sure it preserves the order of the groups as they first appear . I don't have any code to back this up, but some quick testing confirms this intuition.
Thanks @Zelazny7 for this answer. It is exactly what I want. However, it seems in the latest pandas package, to achieve the same Out[7], inplace=True should be added to the arguments in Input[7] .
Adding more information: sort() is now DEPRECATED. its is advisable to use DataFrame.sort_values()
30

Here's a more concise approach...

df['a_bsum'] = df.groupby('A')['B'].transform(sum) df.sort(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1) 

The first line adds a column to the data frame with the groupwise sum. The second line performs the sort and then removes the extra column.

Result:

 A B C 5 baz -2.301539 True 2 baz -0.528172 False 1 bar -0.611756 True 4 bar 0.865408 False 3 foo -1.072969 True 0 foo 1.624345 False 

NOTE: sort is deprecated, use sort_values instead

2 Comments

As with sort_values the last operation is not dropping the column. That is happening because the default is inplace=False. So, specifying inplace=True will also do the work. An alternative would be using the following df.drop('a_bsum', axis=1, inplace=True) after.
Alternatively, assigning the dataframe to the variable df will do the work as well df = df.sort_values(['a_bsum','C'], ascending=[True, False]).drop('a_bsum', axis=1).
9

One way to do this is to insert a dummy column with the sums in order to sort:

In [10]: sum_B_over_A = df.groupby('A').sum().B In [11]: sum_B_over_A Out[11]: A bar 0.253652 baz -2.829711 foo 0.551376 Name: B in [12]: df['sum_B_over_A'] = df.A.apply(sum_B_over_A.get_value) In [13]: df Out[13]: A B C sum_B_over_A 0 foo 1.624345 False 0.551376 1 bar -0.611756 True 0.253652 2 baz -0.528172 False -2.829711 3 foo -1.072969 True 0.551376 4 bar 0.865408 False 0.253652 5 baz -2.301539 True -2.829711 In [14]: df.sort(['sum_B_over_A', 'A', 'B']) Out[14]: A B C sum_B_over_A 5 baz -2.301539 True -2.829711 2 baz -0.528172 False -2.829711 1 bar -0.611756 True 0.253652 4 bar 0.865408 False 0.253652 3 foo -1.072969 True 0.551376 0 foo 1.624345 False 0.551376 

and maybe you would drop the dummy row:

In [15]: df.sort(['sum_B_over_A', 'A', 'B']).drop('sum_B_over_A', axis=1) Out[15]: A B C 5 baz -2.301539 True 2 baz -0.528172 False 1 bar -0.611756 True 4 bar 0.865408 False 3 foo -1.072969 True 0 foo 1.624345 False 

5 Comments

I'm sure I've seen some clever way to do this here (essentially allowing a key to sort), but I can't seem to find it.
Glad to know there's a better way to do df.A.map(dict(zip(sum_B_over_A.index, sum_B_over_A))) :) (should be get_value, no?). Also didn't know about column-wise drops, thanks a lot. (though I kinda prefer the version w/out the dummy column for some reason)
@BirdJaguarIV whoops typo :). Yes, it does seem silly using a dummy (tbh I could've been more clever with my apply [12] to do it in one, and it may well be more efficient, but I decided I wouldn't like to be the person reading it...). Like I say, I think there is a clever way to do this kind of comlex sort :s
You didn't sort by column C.
@MarkByers you can append 'C' to the list of columns to sort by, so it's: df.sort(['sum_B_over_A', 'A', 'B', 'C'])... I should really add link to the sort docs.
0

The question is difficult to understand. However, group by A and sum by B then sort values descending. The column A sort order depends on B. You can then use filtering to create a new dataframe filter by A values order the dataframe.

rand = np.random.RandomState(1) df = pd.DataFrame({'A': ['foo', 'bar', 'baz'] * 2, 'B': rand.randn(6), 'C': rand.rand(6) > .5}) grouped=df.groupby('A')['B'].sum().sort_values(ascending=False) print(grouped) print(grouped.index.get_level_values(0)) 

Output:

A foo 0.551377 bar 0.253651 baz -2.829710 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.