Pandas Merge of Dataframes

Question

I am looking to compare a group of data to the rolled up aggregate of that data.

In the example below, I want to know how much money each restaurant makes as compared to the total for all restaurants. I want to know this by day. If a restaurant is closed that day, I still want to return the name of the day and the name of the restaurant with a NaN (or a zero) for the total bill column.

(I know that there are other ways to do this, but please consider this a question on merging, as there is a reason it needs to be done with two different dataframes)

So, the output I would want is:

day total_bill_x restaurant total_bill_y Fri 651.76 DINER A 325.88 Sat 3,556.80 DINER A 1,778.40 Sun 1,627.16 DINER A NaN Thur 2,192.66 DINER A 1,096.33

But the furthest I can get would return something like the following due to the nature of how outer joins work:

day total_bill_x restaurant total_bill_y Fri 651.76 DINER A 325.88 Sat 3,556.80 DINER A 1,778.40 Sun 1,627.16 NaN NaN Thur 2,192.66 DINER A 1,096.33

The problem is that there is no merge that I can think of that would support this type of output - DINER A will just disappear if there isn't a match.

It gets worse with my real data set, which will consist of dozens of restaurants.

Is it possible to do an outer join that would grab fields from the table being joined to if there isn't match? How can I see all of the records from the larger table for the records in the smaller table where there isn't a match?

I think it is an interesting problem and definitely interested in how others would handle.. Thanks!

Sample Code Follows:

import pandas as pd df=pd.read_csv("https://raw.githubusercontent.com/wesm/pydata-book/master/ch08/tips.csv", sep=',') df2=pd.read_csv("https://raw.githubusercontent.com/wesm/pydata-book/master/ch08/tips.csv", sep=',') df=df[df['day']!="Sun"] df['restaurant']="DINER A" df3=df.append(df2) df_output=df.groupby(['restaurant','day'])[['total_bill']].sum().reset_index() df_output2=df3.groupby(['day'])[['total_bill']].sum().reset_index() pd.merge(df_output2,df_output, on='day', how="outer")

piRSquared · Accepted Answer · 2017-03-15 22:36:17Z

setup

df = pd.read_csv( "https://raw.githubusercontent.com/wesm/pydata-book/master/ch08/tips.csv", sep=',') d1 = pd.concat([df.query('day != "Sun"'), df], keys=['DINER A', 'DINER B']) \ .rename_axis(['restaurant', None]).reset_index('restaurant')

build a `pd.MultiIndex` with all restaurants and days

mux = pd.MultiIndex.from_product([ d1.restaurant.unique(), d1.day.unique() ], names=['restaurant', 'day'])

do `groupby` + `join` + `reindex`

d2 = d1.groupby(['day']).total_bill.sum() d3 = d1.groupby(['restaurant', 'day'])[['total_bill']].sum() d3.reindex(mux).join(d2, lsuffix='_x', rsuffix='_y').reset_index() restaurant day total_bill_x total_bill_y 0 DINER A Sat 1778.40 3556.80 1 DINER A Thur 1096.33 2192.66 2 DINER A Fri 325.88 651.76 3 DINER A Sun NaN 1627.16 4 DINER B Sat 1778.40 3556.80 5 DINER B Thur 1096.33 2192.66 6 DINER B Fri 325.88 651.76 7 DINER B Sun 1627.16 1627.16

This is interesting - thank you, but one question - total_bill_y should always be total per day, but it is NaN in this instance for Diner A, Sunday. Any thoughts?

Vaishali · Accepted Answer · 2017-03-15 20:11:39Z

You can do this in two steps. First map the values of total bill for restaurant A to df_output2

df_output2['DINER_A'] = df_output2['day'].map(df_output.set_index('day')['total_bill'])

gives you

 day total_bill DINER_A 0 Fri 651.76 325.88 1 Sat 3556.80 1778.40 2 Sun 1627.16 NaN 3 Thur 2192.66 1096.33

Now melt the DINER_A column to create two columns restaurant and total_bill

df_output2 = pd.melt(df_output2, id_vars=['day', 'total_bill'],var_name="restaurant", value_name="total_bill_A")

You get

 day total_bill restaurant total_bill_A 0 Fri 651.76 DINER_A 325.88 1 Sat 3556.80 DINER_A 1778.40 2 Sun 1627.16 DINER_A NaN 3 Thur 2192.66 DINER_A 1096.33

Collectives™ on Stack Overflow

Pandas Merge of Dataframes

2 Answers 2

setup

build a `pd.MultiIndex` with all restaurants and days

do `groupby` + `join` + `reindex`

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

setup

build a pd.MultiIndex with all restaurants and days

do groupby + join + reindex

1 Comment

Comments

Related

build a `pd.MultiIndex` with all restaurants and days

do `groupby` + `join` + `reindex`