4

I am still quite new to joining/merging data in Pandas, so would therefore very much appreciate any help to do the following operation. I have the following three SQL tables (converted to DataFrames) of data:

df1
Out[14]:
---- fruit price qty
2010 apple 1.0 2.0
2011 apple 3.0 4.0
2010 banana 0.5 1.5
2011 banana 7.0 8.0

df2
Out[15]:
---- fruit weight
2010 apple 10
2010 banana 12

df3
Out[16]:
-- fruit colour
0 apple red
1 banana yellow

Where df2 has the same fruits as df1, but not the same years (I'm almost completely sure that df2's years are a subset of df1, although it would be nice to find a method, that allows for years in df2 that aren't included in df1). Df3 is a table with characters for all the fruits contained in df2 and df1. I would like to merge the three tables together, so each row in the new combined DataFrame has year, fruit, price, qty, weight (possibly NaN) and colour. I am not sure if such a data structure would be best contained in a Panel or a DataFrame - inputs on this are also very welcome. Thanks!

2 Answers 2

3

To ensure there isn't an issue with the years, I would first reset_index:

In [11]: df1.index.name = 'year' In [12]: df2.index.name = 'year' In [13]: df1.reset_index(inplace=True) In [14]: df2.reset_index(inplace=True) In [15]: df1 Out[15]: year fruit price qty 0 2010 apple 1.0 2.0 1 2011 apple 3.0 4.0 2 2010 banana 0.5 1.5 3 2011 banana 7.0 8.0 [4 rows x 4 columns] In [16]: df2 Out[16]: year fruit weight 0 2010 apple 10 1 2010 banana 12 [2 rows x 3 columns] 

Now you can get your result by merging (twice):

In [17]: df1.merge(df2, how='left').merge(df3, how='left') Out[17]: year fruit price qty weight colour 0 2010 apple 1.0 2.0 10 red 1 2011 apple 3.0 4.0 NaN red 2 2010 banana 0.5 1.5 12 yellow 3 2011 banana 7.0 8.0 NaN yellow [4 rows x 6 columns] 

If you were confident that there was only one weight of fruit (i.e. independent of the year) you could just drop the year column from df2:

In [18]: del df2['year'] In [19]: df1.merge(df2, how='left').merge(df3, how='left') Out[19]: year fruit price qty weight colour 0 2010 apple 1.0 2.0 10 red 1 2011 apple 3.0 4.0 10 red 2 2010 banana 0.5 1.5 12 yellow 3 2011 banana 7.0 8.0 12 yellow [4 rows x 6 columns] 

Otherwise you could do a groupby and ffill.

Sign up to request clarification or add additional context in comments.

Comments

0

At first all JOIN may be performed in SQL - and it will be faster.

If you still want do it only in python use pandas.join:

import pandas as pd df_1_2_joined = pd.join(df1,df2, on='fruit', how='inner') joined = pd.join(df_1_2_joined,df3, on='fruit', how='inner') 

OR

joined = df1.join(df2, on='fruit').join(df3, on='fruit') 

parameter how here is full analogue of SQL-JOINs types INNER|OUTER|LEFT|RIGHT

1 Comment

Hm, I can't get your first (or second) suggestion to work. Isn't the syntax for join, df.join(..)?