pandas concat DataFrame on different Index

Question

General Issue

I have an arbitrary list of pandas.DataFrame's (let's use 2 to keep the example clear), and I want to concat them on an Index that:

is neither the inner nor the outer join of the existing DataFrames
is a different, separate Index, but only has dates within all the DataFrame's

For example, take the following 2 DataFrame's (note the difference in Index shapes):

In [01]: d1 = pandas.DataFrame( numpy.random.randn(15, 4), columns = ['a', 'b', 'c', 'd'], index = pandas.DatetimeIndex(start = '01/01/2001', freq = 'b', periods = 15) ) In [02]: d2 = pandas.DataFrame( numpy.random.randn(17, 4), columns = ['e', 'f', 'g', 'h'], index = pandas.DatetimeIndex(start = '01/05/2001', freq = 'b', periods = 17) )

I would like to join these two DataFrame's on an intersecting Index, such my_index, constructed here:

In [03]: ind = range(0, 10, 2) In [04]: my_index = d2.index[ind].copy()

So the following result should have the same results as:

In [05]: d1.loc[my_index, :].join(d2.loc[my_index, :] ) Out[65]: a b c d e f \ 2001-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680 2001-01-09 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177 2001-01-11 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204 2001-01-15 0.430000 0.502100 0.194092 0.588685 -0.507332 1.404635 2001-01-17 1.005721 0.604771 -2.296667 0.157201 1.583537 1.359332 g h 2001-01-05 -1.183528 1.260880 2001-01-09 0.352487 0.700853 2001-01-11 1.060694 0.040667 2001-01-15 -0.044510 0.565152 2001-01-17 -0.731624 -0.331027

Personal Considerations

Because this is for a larger application, and I will have an arbitrary number of DataFrame's I'd like to:

Use existing pandas functionality instead of building my own hack, i.e. reduce( map ( ) ) etc.
Return views of the intersection of the DataFrame's instead of creating copies of the DataFrame's

unutbu · Accepted Answer · 2015-01-14 21:11:48Z

I don't think there is an out-of-the-box Pandas function for doing this. However, it's not hard to build your own:

def select_join(dfs, index): result = dfs[0].reindex(index) for df in dfs[1:]: result = result.join(df, how='inner') return result

For example,

import numpy as np import pandas as pd import string import itertools as IT columns = iter(string.letters) dfs = [] for i in range(3): d1 = pd.DataFrame( np.random.randn(15, 4), columns = list(IT.islice(columns, 4)), index = pd.DatetimeIndex(start = '01/01/2001', freq = 'b', periods = 15)) dfs.append(d1) ind = range(0, 10, 2) my_index = d1.index[ind].copy() print(select_join(dfs, my_index))

yields

 a b c d e f \ 2001-01-01 0.228430 -1.154375 -0.612703 -2.760826 -0.877355 -0.071581 2001-01-03 1.452750 1.341027 0.051486 1.231563 0.428353 1.320172 2001-01-05 -0.966979 -1.997200 -0.376060 -0.692346 -1.689897 0.549653 2001-01-09 -0.117443 -0.888103 2.092829 -0.467220 -1.083004 -1.443015 2001-01-11 -0.168980 -0.152663 0.365618 0.444175 -1.472091 -0.578182 g h i j k l 2001-01-01 -0.098758 0.920457 -1.072377 -0.627720 0.223060 0.903130 2001-01-03 1.962124 1.134501 -0.209813 -2.309090 0.358121 0.655156 2001-01-05 1.088195 -1.705393 -0.161167 -0.339617 0.945495 0.220701 2001-01-09 0.970829 1.931192 0.943150 -1.895580 0.815188 -1.485206 2001-01-11 0.747193 -1.221069 -0.164531 -0.395197 -0.754051 0.922090

Regarding the second consideration: It is impossible to return a view if index is arbitrary. The DataFrame stores data (of like dtype) in a NumPy array. When you select arbitrary rows from a NumPy array, space for a new array is allocated and the rows are copied from the original array into the new array. Only when the selection can be expressed as a basic slice is a view returned. This limitation of NumPy -- a very hard limitation to remove! -- bubbles up into Pandas, causing DataFrames to return copies when the index is not expressible as a basic slice.

I'm really surprised there isn't out of the box functionality for this (because it seems like trivial use case). Before I RTD I thought concat( [df_1, df_2], join_axes = my_axis) was the specific functionality I was looking for, however, you most certainly would know! Thanks for the response @unutbu!
For completeness, the fastest implementation I could come up with was: def join_on_index(df_list, index): return pandas.concat( map( lambda x: x.reindex(index), df_list), axis = 1)
Interesting! Feel free to post that as an answer (and accept it if you find that's the best solution.) The reason why I chose to avoid concat here is because it can raise an error if the index contains duplicates while join does not.
Then maybe I'll post it as an answer for completeness, but I will certainly only accept yours as the correct answer :-)

benjaminmgross · Accepted Answer · 2015-01-15 14:57:15Z

Different Methods & Their Times (for Completeness)

I've accepted @unutbu's answer, but I thought it might be valuable to show the two functions I created (and @unutbu's) and their different %timeitvalues in case anyone wants to use it:

Create the `df_list` and `my_index`:

dfs = [] for i in range(5): tmp = pandas.DataFrame( numpy.random.randn(1000, 4), columns = list(itertools.islice(columns, 4)), index = pandas.DatetimeIndex(start = '01/01/2000', freq = 'b', periods = 1000) ) dfs.append(tmp) ind = range(0, 1000, 2) my_index = tmp.index[ind].copy()

3 Different Implementations

def join_on_index_a(df_list, index): return pandas.concat( map( lambda x: x.reindex(index), df_list), axis = 1 ) #@unutbu's implementation def join_on_index_b(df_list, index): result = dfs[0].reindex(index) for df in dfs[1:]: result = result.join(df, how='inner') return result def join_on_index_c(df_list, index): return pandas.concat( map( lambda x: x.loc[index, :], df_list), axis = 1)

The Results Using iPython `%timeit`

In [49]: %timeit join_on_index_a(dfs, my_index) 1000 loops, best of 3: 1.85 ms per loop In [50]: %timeit join_on_index_b(dfs, my_index) 100 loops, best of 3: 1.94 ms per loop In [51]: %timeit join_on_index_c(dfs, my_index) 100 loops, best of 3: 21.5 ms per loop

Collectives™ on Stack Overflow

pandas concat DataFrame on different Index

General Issue

Personal Considerations

2 Answers 2

4 Comments

Different Methods & Their Times (for Completeness)

Create the `df_list` and `my_index`:

3 Different Implementations

The Results Using iPython `%timeit`

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

General Issue

Personal Considerations

2 Answers 2

4 Comments

Different Methods & Their Times (for Completeness)

Create the df_list and my_index:

3 Different Implementations

The Results Using iPython %timeit

Comments

Linked

Related

Create the `df_list` and `my_index`:

The Results Using iPython `%timeit`