pandas - merging multiple DataFrames

Question

This is a multi-part question. I just can't seem to combine everything together. The goal is to to create one DataFrame (guessing using MultiIndex) that I can access as follows:

ticker = 'GOLD' date = pd.to_datetime('1978/03/31') current_bar = df.ix[ticker].ix[date]

Can I then just say: current_bar.Last ?

Anyway, here are the files, and how I load them.

In [108]: df = pd.read_csv('GOLD.csv', parse_dates='Date', index_col='Date') In [109]: df Out[109]: Exp Last Volume Date 1978-03-30 198002 995.6 54 1978-03-31 198002 999.5 78 In [110]: df2 = pd.read_csv('SPX.csv', parse_dates='Date', index_col='Date') In [111]: df2 Out[111]: Exp Last Volume Date 1978-03-30 198003 215.5 25 1978-03-31 198003 214.1 99

Ideally, I want it to look like this (I think):

ticker GOLD SPX values Exp Last Volume Exp Last Volume Date 1978-03-30 198002 995.6 54 198003 215.5 25 1978-03-31 198002 999.5 78 198003 214.1 99

I guess my questions are:
- How do I make this Hierarchical (the actual data has 20+ identical columns for each file)
- How do I then combine the files (I have about 100 that need to all go in 1 DataFrame)
- Is my assumption correct that I can then just do: current_bar.Last to get values?

Thanks so much.

unutbu · Accepted Answer · 2014-07-24 02:02:21Z

You can use pd.concat to concatenate DataFrames. (Concatenating smushes DataFrames together, while merging joins DataFrames based on common indices or columns). When you supply the keys parameter, you get a hierarchical index:

import pandas as pd df = pd.read_csv('GOLD.csv', parse_dates='Date', index_col='Date', sep='\s+') df2 = pd.read_csv('SPX.csv', parse_dates='Date', index_col='Date', sep='\s+') result = pd.concat([df, df2], keys=['GOLD', 'SPX'], names=['ticker']).unstack('ticker') result = result.reorder_levels([1, 0], axis=1).sortlevel(level=0, axis=1) print(result)

yields

ticker GOLD SPX Exp Last Volume Exp Last Volume Date 1978-03-30 198002 995.6 54 198003 215.5 25 1978-03-31 198002 999.5 78 198003 214.1 99

result['Last'] yields the DataFrame:

In [147]: result['Last'] Out[147]: ticker GOLD SPX Date 1978-03-30 995.6 215.5 1978-03-31 999.5 214.1

I'd recommend avoiding the syntax result.Last because it is too close to result.last, which returns a DataFrame method.

To handle more files, you might use code like this:

import pandas as pd dfs = list() for filename in filenames: df = pd.read_csv(filename, parse_dates='Date', index_col='Date') # compute moving_mean dfs.append(df) keys = [filename[:-4] for filename in filenames] result = pd.concat(dfs, keys=keys, names=['ticker']).unstack('ticker')

Note that this does require enough memory to hold a list of all the DataFrames in memory plus enough memory to hold result.

Got it. What if I want to add another df to 'result' (ie 'NDX'). what would I use for the keys, given that result already has 2? (I just need to do this for 100 files, so need a way to do it iteratively. Thanks
The first argument to pd.concat can be a list of 100 DataFrames, and keys can be a list of 100 ticker names. For example, to concatenate 3 DataFrames: result = pd.concat([df, df2, df3], keys=['GOLD', 'SPX', 'NDX'], names=['ticker']).unstack('ticker')
Is there a way to do something like this: result = pd.concat([result, df3], keys =[result.keys, 'NDX'], names=['ticker']).unstack('ticker')?
The reason for my above question is that I need to perform some functions (moving_mean) on the files before adding them to the 'result' df. As far as I know I can't do those when in MultiIndex, correct?
I've edited the code (above) to show how you might compute the moving_mean before concatening.

Collectives™ on Stack Overflow

pandas - merging multiple DataFrames

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related