how to calculate correlation between rows in python pandas data frame

Question

I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:

dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d']) dfa.ix[0] = [2,6,8,12] a b c d 2.0 6.0 8.0 12.0 dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d']) a b c d 0 2 6 8 12 1 1 3 4 6 2 -1 -3 -4 -6

I expect to get:

0 1 1 0.5 2 -0.5

I tried many version, for example:

dfb.T.corrwith(dfa.T, axis=0)

But ll I get is a lot of Nan's

Can you please make clear the rows you are trying to find correlation between? — Yogesh
– Yogesh, Commented Nov 2, 2017 at 12:12
Be sure to check out NumPy based one for performance - stackoverflow.com/a/47075690. — Divakar
– Divakar, Commented Nov 2, 2017 at 12:52

seralouk · Accepted Answer · 2017-11-02 12:31:57Z

6

First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.

Solution

dfb.corrwith(dfa.iloc[0], axis=1)

Results

0 1.0 1 1.0 2 -1.0 dtype: float64

edited Nov 2, 2017 at 12:31

answered Nov 2, 2017 at 12:20

seralouk

33.6k10 gold badges127 silver badges141 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

FcoRodr Over a year ago

No need to use for loop, you can dataframe.corrwith(series). I.e.: dfb.corrwith(dfa.iloc[0], axis=1)

jezrael Over a year ago

remove second solution, first is best

Naomi Fridman Over a year ago

of course you are right about the correlation numbers. My original task was to find the slope of the linear regression, but that takes too long, so I am looking for a faster solution.

Yogesh · Accepted Answer · 2018-12-05 05:05:06Z

I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:

NOTE: the correct correlation is [1,1,-1]

pd.DataFrame(dfb.transpose()).corr()

Thank you it's working. But I need only correlation between one row vs all the rest. Your solution calculates all row pairs correlations.

Divakar · Accepted Answer · 2017-11-02 13:04:14Z

Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -

pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))

Sample run -

In [74]: dfa Out[74]: a b c d 0 2.0 6.0 8.0 12.0 In [75]: dfb Out[75]: a b c d 0 2 6 8 12 1 1 3 4 6 2 -1 -3 -4 -6 In [76]: pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values)) Out[76]: 0 1.0 1 1.0 2 -1.0 dtype: float64

Runtime test

Case #1 : Large number of rows in dfb and 4 columns -

In [77]: dfa = pd.DataFrame(np.random.randint(1,100,(1,4))) In [78]: dfb = pd.DataFrame(np.random.randint(1,100,(30000,4))) # @sera's soln In [79]: %timeit dfb.corrwith(dfa.iloc[0], axis=1) 1 loop, best of 3: 4.09 s per loop In [80]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values)) 1000 loops, best of 3: 1.53 ms per loop

Case #2 : Decent number of rows in dfb and 400 columns -

In [83]: dfa = pd.DataFrame(np.random.randint(1,100,(1,400))) In [85]: dfb = pd.DataFrame(np.random.randint(1,100,(300,400))) In [86]: %timeit dfb.corrwith(dfa.iloc[0], axis=1) 10 loops, best of 3: 44.8 ms per loop In [87]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values)) 1000 loops, best of 3: 635 µs per loop

Collectives™ on Stack Overflow

how to calculate correlation between rows in python pandas data frame

3 Answers 3

3 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Linked

Related