1

I would like to perform correlation test using python (equivalent to corr.test(x,y) in R)

My input is a Pandas dataframe. Looks something like the following:

df1:

 Column1 Column2 Column3 Column4 Column5 Column6 0 ab1 bc1 6.843147 NaN 5.12 NaN 1 ab2 ab5 NaN 5.6789 6.666 54.72 2 ab3 bc4 11.45 NaN 12.765 5.12 3 ab4 ab5 328.880123 NaN 0.50 88.44 4 ab5 ab1 72.142790 55.89 NaN 18.12 

How do I perform correlation for the data (column3 - column6)?

Note: There are more than 50 columns for correlation in the original data.

1 Answer 1

1

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html


Or do any pair of columns at once (remembering that each column is a series) ... with

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html

For example, given your data above, the correlation between colums 5 and 6 is given by:

In [10]: df Out[10]: Column1 Column2 Column3 Column4 Column5 Column6 0 ab1 bc1 6.843147 NaN 5.120 NaN 1 ab2 ab5 NaN 5.6789 6.666 54.72 2 ab3 bc4 11.450000 NaN 12.765 5.12 3 ab4 ab5 328.880123 NaN 0.500 88.44 4 ab5 ab1 72.142790 55.8900 NaN 18.12 In [11]: df.loc[:,'Column5'].corr(df.loc[:,'Column6']) Out[11]: -0.9936504010065057 

Or to loop through all columns (not the most elegant, but this works) ...

In [12]: for c1 in df.columns[0:-1]: ...: for c2 in df.loc[:,c1:].columns: ...: if c2 != c1: ...: print('Correlation',c1,c2,'=',df.loc[:,c1].corr(df.loc[:,c2])) ...: ...function_base.py:2551: RuntimeWarning: Degrees of freedom <= 0 for slice c = cov(x, y, rowvar) ...function_base.py:2480: RuntimeWarning: divide by zero encountered in true_divide c *= np.true_divide(1, fact) Correlation Column3 Column4 = nan Correlation Column3 Column5 = -0.779129 Correlation Column3 Column6 = 0.999368 Correlation Column4 Column5 = nan Correlation Column4 Column6 = -1.000000 Correlation Column5 Column6 = -0.993650 

For an entire correlation matrix:

In [36]: df Out[36]: Column1 Column2 Column3 Column4 Column5 Column6 0 ab1 bc1 6.843147 NaN 5.120 NaN 1 ab2 ab5 NaN 5.6789 6.666 54.72 2 ab3 bc4 11.450000 NaN 12.765 5.12 3 ab4 ab5 328.880123 NaN 0.500 88.44 4 ab5 ab1 72.142790 55.8900 NaN 18.12 In [37]: df.corr() Out[37]: Column3 Column4 Column5 Column6 Column3 1.000000 NaN -0.779129 0.999368 Column4 NaN 1.0 NaN -1.000000 Column5 -0.779129 NaN 1.000000 -0.993650 Column6 0.999368 -1.0 -0.993650 1.000000 

Notice that with DataFrame.corr() which gives a correlation matrix, the intersection of any two columns displays the same correlation that was arrived at using Series.corr() while looping through the columns. Thus the DataFrame.corr() approach is simpler code-wise because you don't have to write your own loops.

P.S. I just realized you want the p-value also (not just the correlation coefficients) since the R function cor.test() returns both coefficient and significance. I'm not sure how to do that with Pandas. I poked around and found this:  About half-way down that page it states, "Pandas does not have a function that calculates p-values, so it is better to use SciPy to calculate correlation as it will give you both p-value and correlation coefficient," and then shows how to do that.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the information. Could you please let me know a detailed or specific example? Thanks again!
@SucharitaMuthuswamy - I have edited the answer to include some examples using your data. Hope that helps.
Thank you very much! I was trying to use Pingouin (github.com/raphaelvallat/pingouin), since it was similar to the R output. But, if I divide the df into x and y I alway get an error saying AttributeError: 'str' object has no attribute '_get_numeric_data'.