1

General Issue

I have an arbitrary list of pandas.DataFrame's (let's use 2 to keep the example clear), and I want to concat them on an Index that:

  1. is neither the inner nor the outer join of the existing DataFrames
  2. is a different, separate Index, but only has dates within all the DataFrame's

For example, take the following 2 DataFrame's (note the difference in Index shapes):

In [01]: d1 = pandas.DataFrame( numpy.random.randn(15, 4), columns = ['a', 'b', 'c', 'd'], index = pandas.DatetimeIndex(start = '01/01/2001', freq = 'b', periods = 15) ) In [02]: d2 = pandas.DataFrame( numpy.random.randn(17, 4), columns = ['e', 'f', 'g', 'h'], index = pandas.DatetimeIndex(start = '01/05/2001', freq = 'b', periods = 17) ) 

I would like to join these two DataFrame's on an intersecting Index, such my_index, constructed here:

In [03]: ind = range(0, 10, 2) In [04]: my_index = d2.index[ind].copy() 

So the following result should have the same results as:

In [05]: d1.loc[my_index, :].join(d2.loc[my_index, :] ) Out[65]: a b c d e f \ 2001-01-05 1.702556 -0.885554 0.766257 -0.731700 -1.071232 1.806680 2001-01-09 -0.968689 -0.700311 1.024988 -0.705764 0.804285 -0.337177 2001-01-11 1.249893 -0.613356 1.975736 -0.093838 0.428004 0.634204 2001-01-15 0.430000 0.502100 0.194092 0.588685 -0.507332 1.404635 2001-01-17 1.005721 0.604771 -2.296667 0.157201 1.583537 1.359332 g h 2001-01-05 -1.183528 1.260880 2001-01-09 0.352487 0.700853 2001-01-11 1.060694 0.040667 2001-01-15 -0.044510 0.565152 2001-01-17 -0.731624 -0.331027 

Personal Considerations

Because this is for a larger application, and I will have an arbitrary number of DataFrame's I'd like to:

  1. Use existing pandas functionality instead of building my own hack, i.e. reduce( map ( ) ) etc.
  2. Return views of the intersection of the DataFrame's instead of creating copies of the DataFrame's

2 Answers 2

2

I don't think there is an out-of-the-box Pandas function for doing this. However, it's not hard to build your own:

def select_join(dfs, index): result = dfs[0].reindex(index) for df in dfs[1:]: result = result.join(df, how='inner') return result 

For example,

import numpy as np import pandas as pd import string import itertools as IT columns = iter(string.letters) dfs = [] for i in range(3): d1 = pd.DataFrame( np.random.randn(15, 4), columns = list(IT.islice(columns, 4)), index = pd.DatetimeIndex(start = '01/01/2001', freq = 'b', periods = 15)) dfs.append(d1) ind = range(0, 10, 2) my_index = d1.index[ind].copy() print(select_join(dfs, my_index)) 

yields

 a b c d e f \ 2001-01-01 0.228430 -1.154375 -0.612703 -2.760826 -0.877355 -0.071581 2001-01-03 1.452750 1.341027 0.051486 1.231563 0.428353 1.320172 2001-01-05 -0.966979 -1.997200 -0.376060 -0.692346 -1.689897 0.549653 2001-01-09 -0.117443 -0.888103 2.092829 -0.467220 -1.083004 -1.443015 2001-01-11 -0.168980 -0.152663 0.365618 0.444175 -1.472091 -0.578182 g h i j k l 2001-01-01 -0.098758 0.920457 -1.072377 -0.627720 0.223060 0.903130 2001-01-03 1.962124 1.134501 -0.209813 -2.309090 0.358121 0.655156 2001-01-05 1.088195 -1.705393 -0.161167 -0.339617 0.945495 0.220701 2001-01-09 0.970829 1.931192 0.943150 -1.895580 0.815188 -1.485206 2001-01-11 0.747193 -1.221069 -0.164531 -0.395197 -0.754051 0.922090 

Regarding the second consideration: It is impossible to return a view if index is arbitrary. The DataFrame stores data (of like dtype) in a NumPy array. When you select arbitrary rows from a NumPy array, space for a new array is allocated and the rows are copied from the original array into the new array. Only when the selection can be expressed as a basic slice is a view returned. This limitation of NumPy -- a very hard limitation to remove! -- bubbles up into Pandas, causing DataFrames to return copies when the index is not expressible as a basic slice.

Sign up to request clarification or add additional context in comments.

4 Comments

I'm really surprised there isn't out of the box functionality for this (because it seems like trivial use case). Before I RTD I thought concat( [df_1, df_2], join_axes = my_axis) was the specific functionality I was looking for, however, you most certainly would know! Thanks for the response @unutbu!
For completeness, the fastest implementation I could come up with was: def join_on_index(df_list, index): return pandas.concat( map( lambda x: x.reindex(index), df_list), axis = 1)
Interesting! Feel free to post that as an answer (and accept it if you find that's the best solution.) The reason why I chose to avoid concat here is because it can raise an error if the index contains duplicates while join does not.
Then maybe I'll post it as an answer for completeness, but I will certainly only accept yours as the correct answer :-)
1

Different Methods & Their Times (for Completeness)

I've accepted @unutbu's answer, but I thought it might be valuable to show the two functions I created (and @unutbu's) and their different %timeitvalues in case anyone wants to use it:

Create the df_list and my_index:

dfs = [] for i in range(5): tmp = pandas.DataFrame( numpy.random.randn(1000, 4), columns = list(itertools.islice(columns, 4)), index = pandas.DatetimeIndex(start = '01/01/2000', freq = 'b', periods = 1000) ) dfs.append(tmp) ind = range(0, 1000, 2) my_index = tmp.index[ind].copy() 

3 Different Implementations

def join_on_index_a(df_list, index): return pandas.concat( map( lambda x: x.reindex(index), df_list), axis = 1 ) #@unutbu's implementation def join_on_index_b(df_list, index): result = dfs[0].reindex(index) for df in dfs[1:]: result = result.join(df, how='inner') return result def join_on_index_c(df_list, index): return pandas.concat( map( lambda x: x.loc[index, :], df_list), axis = 1) 

The Results Using iPython %timeit

In [49]: %timeit join_on_index_a(dfs, my_index) 1000 loops, best of 3: 1.85 ms per loop In [50]: %timeit join_on_index_b(dfs, my_index) 100 loops, best of 3: 1.94 ms per loop In [51]: %timeit join_on_index_c(dfs, my_index) 100 loops, best of 3: 21.5 ms per loop 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.