7

I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:

dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d']) dfa.ix[0] = [2,6,8,12] a b c d 2.0 6.0 8.0 12.0 dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d']) a b c d 0 2 6 8 12 1 1 3 4 6 2 -1 -3 -4 -6 

I expect to get:

0 1 1 0.5 2 -0.5 

I tried many version, for example:

dfb.T.corrwith(dfa.T, axis=0) 

But ll I get is a lot of Nan's

2
  • Can you please make clear the rows you are trying to find correlation between? Commented Nov 2, 2017 at 12:12
  • Be sure to check out NumPy based one for performance - stackoverflow.com/a/47075690. Commented Nov 2, 2017 at 12:52

3 Answers 3

6

First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.

Solution

dfb.corrwith(dfa.iloc[0], axis=1) 

Results

0 1.0 1 1.0 2 -1.0 dtype: float64 
Sign up to request clarification or add additional context in comments.

3 Comments

No need to use for loop, you can dataframe.corrwith(series). I.e.: dfb.corrwith(dfa.iloc[0], axis=1)
remove second solution, first is best
of course you are right about the correlation numbers. My original task was to find the slope of the linear regression, but that takes too long, so I am looking for a faster solution.
4

I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:

NOTE: the correct correlation is [1,1,-1]

pd.DataFrame(dfb.transpose()).corr()

1 Comment

Thank you it's working. But I need only correlation between one row vs all the rest. Your solution calculates all row pairs correlations.
2

Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -

pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values)) 

Sample run -

In [74]: dfa Out[74]: a b c d 0 2.0 6.0 8.0 12.0 In [75]: dfb Out[75]: a b c d 0 2 6 8 12 1 1 3 4 6 2 -1 -3 -4 -6 In [76]: pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values)) Out[76]: 0 1.0 1 1.0 2 -1.0 dtype: float64 

Runtime test

Case #1 : Large number of rows in dfb and 4 columns -

In [77]: dfa = pd.DataFrame(np.random.randint(1,100,(1,4))) In [78]: dfb = pd.DataFrame(np.random.randint(1,100,(30000,4))) # @sera's soln In [79]: %timeit dfb.corrwith(dfa.iloc[0], axis=1) 1 loop, best of 3: 4.09 s per loop In [80]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values)) 1000 loops, best of 3: 1.53 ms per loop 

Case #2 : Decent number of rows in dfb and 400 columns -

In [83]: dfa = pd.DataFrame(np.random.randint(1,100,(1,400))) In [85]: dfb = pd.DataFrame(np.random.randint(1,100,(300,400))) In [86]: %timeit dfb.corrwith(dfa.iloc[0], axis=1) 10 loops, best of 3: 44.8 ms per loop In [87]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values)) 1000 loops, best of 3: 635 µs per loop 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.