2

So i have a dataset about customers in a store and the sales of that store on each day.

Which looks something like this -

Store ID Sales Customers 1 250 500 2 276 786 3 124 256 5 164 925 

How do i convert it to grouped data, something like this

Sales Customers 0-100 0 100-200 1181 200-300 1286 

I have searched for a while and found this from the pandas site - http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html

df2.groupby(['X'], sort=True).sum()

But i am unable to understand how to apply the same to my example.

3 Answers 3

1

Use cut for bins and then groupby and aggregate sum:

df = df.groupby(pd.cut(df['Sales'], [0,100,200,300]))['Customers'].sum().fillna(0) print (df) Sales (0, 100] 0.0 (100, 200] 1181.0 (200, 300] 1286.0 Name: Customers, dtype: float64 

Also is possible define labels:

l =['0-100','100-200','200-300'] b = [0,100,200,300] df = df.groupby(pd.cut(df['Sales'], bins=b, labels=l))['Customers'].sum() .fillna(0) .reset_index() print (df) Sales Customers 0 0-100 0.0 1 100-200 1181.0 2 200-300 1286.0 
Sign up to request clarification or add additional context in comments.

Comments

1

Use pd.cut with cusom bins

In [2058]: df.groupby(pd.cut(df.Sales, [0, 100, 200, 300])).Customers.sum().fillna(0) Out[2058]: Sales (0, 100] 0.0 (100, 200] 1181.0 (200, 300] 1286.0 Name: Customers, dtype: float64 

And reset_index() to reshape

In [2061]: (df.groupby(pd.cut(df.Sales, [0, 100, 200, 300])).Customers.sum() .fillna(0).reset_index()) Out[2061]: Sales Customers 0 (0, 100] 0.0 1 (100, 200] 1181.0 2 (200, 300] 1286.0 

And, use labels for custom names

In [2062]: (df.groupby(pd.cut(df.Sales, bins=[0, 100, 200, 300], labels=['0-100', '100-200', '200-300'])) .Customers.sum().fillna(0).reset_index()) Out[2062]: Sales Customers 0 0-100 0.0 1 100-200 1181.0 2 200-300 1286.0 

Comments

1

with np.searchsorted

labels = {0: '0-100', 1: '100-200', 2: '200-300'} s = np.searchsorted([100, 200], df.Sales) df.groupby(s).Customers.sum().reindex( pd.Index([0, 1, 2], name='Sales'), fill_value=0 ).rename(labels).reset_index() Sales Customers 0 0-100 0 1 100-200 1181 2 200-300 1286 

2 Comments

Can I ask something? I use your solution for timings and it seems wrong output here. And I have no idea what is wrong. Can you check it?
Your functions don't accept a parameter. Therefore data is always the same. Use this for stmt parameter: '{}(df)'.format(j). And place data in the signature of each function you define. dropna(data) and notnull(data) and query(data)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.