Conditional iteration of key,value in DataFrameGroupBy

Question

I have a pandas (v 0.12) dataframe data in python (2.7). I groupby() with respect to the A and B colmuns in data to form the groups object which is of type <class 'pandas.core.groupby.DataFrameGroupBy'>.

I want to loop through and apply a function to the dataframes within groups that have more than one row in them. My code is below, here each dataframe is the value in the key,value pair:

import pandas as pd groups = data.groupby(['A','B']) len(groups) >> 196320 # too large - will be slow to iterate through all for key, value in groups: if len(value)>1: print(value)

Since I am only interested in applying the function to values where len(value)>1, is it possible to save time by embedding this condition to filter and loop through only the key-value pairs that satisfy this condition. I can do something like below to ascertain the size of each value but I am not sure how to marry this aggreagation with the original groups object.

size_values = data.groupby(['A','B']).agg({'C' : [np.size]})

I am hoping the question is clear, please let me know if any clarification is needed.

Primer · Accepted Answer · 2015-02-20 11:52:48Z

1

You could assign length of the group back to column and filter by its value:

data['count'] = data.groupby(['A','B'],as_index=False)['A'].transform(np.size)

After that you could:

data[data['count'] > 1].groupby(['A','B']).apply(your_function)

Or just skip assignment if it is a one time operation:

 data[data.groupby(['A','B'],as_index=False)['A'].transform(np.size) > 1].groupby(['A','B']).apply(your_function)

edited Feb 20, 2015 at 11:52

answered Feb 20, 2015 at 11:09

Primer

10.4k5 gold badges48 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Zhubarb Over a year ago

Thank you, but the first line on its own I think runs longer than my for loop. I did not realise I could use .apply() on the DataFrameGroupBy though, so maybe that would speed things up (as compared to my clumsy for loop)

Primer Over a year ago

.transform is usually pretty fast and combined with np.size it is unlikely it will be slower than your function.

Zhubarb Over a year ago

so it is .transform(np.size), not .transform('count')

Primer Over a year ago

Correct, you can use whatever is faster as long it returns a scalar to transform.

Collectives™ on Stack Overflow

Conditional iteration of key,value in DataFrameGroupBy

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related