Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?

Question

I have a DataFrame object stocks filled with stock returns. I have another DataFrame object industries filled with industry returns. I want to find each stock's correlation with each industry.

import numpy as np np.random.seed(123) df1=pd.DataFrame( {'s1':np.random.randn(10000), 's2':np.random.randn(10000) } ) df2=pd.DataFrame( {'i1':np.random.randn(10000), 'i2':np.random.randn(10000) } )

The expensive way to do this is to merge the two DataFrame objects, calculate correlation, and then throw out all the stock to stock and industry to industry correlations. Is there a more efficient way to do this?

ytsaig · Accepted Answer · 2016-03-30 07:27:33Z

And here's a one-liner that uses apply on the columns and avoids the nested for loops. The main benefit is that apply builds the result in a DataFrame.

df1.apply(lambda s: df2.corrwith(s))

JohnE · Accepted Answer · 2021-05-14 15:00:41Z

Here's a slightly simpler answer than @JohnE's that uses pandas natively instead of using numpy.corrcoef. As an added bonus, you don't have to retrieve the correlation value out of a silly 2x2 correlation matrix, because pandas's series-to-series correlation function simply returns a number, not a matrix.

for s in ['s1','s2']: for i in ['i1','i2']: print df1[s].corr(df2[i])

This is not as simple as @ytsaig's but is approx 5x faster based on some quick timings I did, so you should consider this answer if you need a faster solution.

JohnE · Accepted Answer · 2021-05-14 15:06:45Z

Edit to add: I'll leave this answer for posterity but would recommend the later answers. In particular, use @ytsaig's if you want the simplest answer but use @failwhales's if you want a faster answer (seems to be about 5x faster than @ytsaig's in some quick timings I did using the data in the OP and about the same speed as mine).

Original answer: You could go with numpy.corrcoef() which is basically the same as corr in pandas, but the syntax may be more amenable to what you want.

for s in ['s1','s2']: for i in ['i1','i2']: print( 'corrcoef',s,i,np.corrcoef(df1[s],df2[i])[0,1] )

That prints:

corrcoef s1 i1 -0.00416977553597 corrcoef s1 i2 -0.0096393047035 corrcoef s2 i1 -0.026278689352 corrcoef s2 i2 -0.00402030582064

Alternatively you could load the results into a dataframe with appropriate labels:

cc = pd.DataFrame() for s in ['s1','s2']: for i in ['i1','i2']: cc = cc.append( pd.DataFrame( { 'corrcoef':np.corrcoef(df1[s],df2[i])[0,1] }, index=[s+'_'+i]))

Which looks like this:

 corrcoef s1_i1 -0.004170 s1_i2 -0.009639 s2_i1 -0.026279 s2_i2 -0.004020

jarekj71 · Accepted Answer · 2020-07-17 17:13:56Z

Quite late, but more general solution:

def corrmatrix(df1,df2): s = df1.values.shape[1] cr = np.corrcoef(df1.values.T,df2.values.T)[s:,:s] return pd.DataFrame(cr,index = df2.columns,columns = df1.columns)

Collectives™ on Stack Overflow

Calculate correlation between all columns of a DataFrame and all columns of another DataFrame?

4 Answers 4

Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Linked

Related