how to compute correlation coefficient for multi-variable 1 column

Question

Category SubCategory Month Value A A1 Jan 1 A A1 Feb 2 A A1 Mar 3 A A2 Jan 2 A A2 Feb 3 A A2 Mar 5 B B1 Jan 1 B B1 Feb 6 B B1 Mar 7 B B2 Jan 3 B B2 Feb 6 B B2 Mar 7

I have a sample pandas df like this. I would like to compute the correlation coefficient between Subgroup Categories A1 and A2, B1 and B2, but not A1 and B1 etc. My end goal is to have a table such as this:

 A1 A2 B1 B2 A1 1.0000 0.9820 A2 0.9820 1.0000 B1 1.0000 0.9963 B2 0.9963 1.0000

Can anyone help me with python code?

Obviously this one gives me a corr value of 1 for each SubCategory

df.groupby('SubCategory').corr()

OP's data is not normalized, so it has the potential to be inconsistent. So solutions would rely on the fact that data is valid, e.g., if one of the "Feb" be replaced by a "Apr", the solution by W-B would break, while that by indominus would give a result without detecting the error (also potential dangerous) — Indominus
– Indominus, Commented Dec 13, 2018 at 6:26

BENY · Accepted Answer · 2018-12-13 04:43:43Z

3

1st it is pivot problem , then just using corr

pd.concat([x.pivot('Month','SubCategory','Value').corr() for _,x in df.groupby('Category')]) A1 A2 B1 B2 SubCategory A1 1.000000 0.981981 NaN NaN A2 0.981981 1.000000 NaN NaN B1 NaN NaN 1.000000 0.996271 B2 NaN NaN 0.996271 1.000000

answered Dec 13, 2018 at 4:43

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Pyd Over a year ago

may I know why Month is index? and x[month].values are not displayed in index

rookiescientist Over a year ago

thank you for the answer, it works, however for the real-time application, I have 4700x 4700 matrix with mostly empty/NAN values. Is there any way to group these (create new dfs by category?)

tel Over a year ago

Just skip the final pd.concat, and you'll get 2 separate dataframes, one for each category.

BENY Over a year ago

@rookiescientist just do not do concat

Indominus · Accepted Answer · 2018-12-14 00:23:31Z

Data

import pandas as pd df = pd.DataFrame({"Category" : ["A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"], "SubCategory": ["A1", "A1", "A1", "A2", "A2", "A2", "B1", "B1", "B1", "B2", "B2", "B2"], "Value": [1, 2, 3, 2, 3, 5, 1, 6, 7, 3, 6, 7]})

Solution

import scipy as sp # this will contain a list of DataFrames storing the correlation matrices correlations = [] for g in df.groupby("Category"): sub_df = g[1][["SubCategory", "Value"]] data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list) correlation = pd.DataFrame(sp.corrcoef(data.values.tolist()[0]), columns=data.columns.values.tolist(), index=data.columns.values.tolist()) correlations.append(correlation) pd.concat(correlations, sort=False)

Output

 A1 A2 B1 B2 ________________________________________________ A1 1.000000 0.996271 NaN NaN A2 0.996271 1.000000 NaN NaN B1 NaN NaN 1.000000 0.996271 B2 NaN NaN 0.996271 1.000000

Update

This solution was tested on python and pandas versions as shown below, older versions may not work:

from platform import python_version print('python version:', python_version()) import pandas as pd print('pandas version:', pd.__version__) python version: 3.7.0 pandas version: 0.23.4

i copy pasted both of your code and it gives me this error: ValueError: Shape of passed values is (2, 2), indices imply (2, 1)
on this line: data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list)
hmm, I just got the same error on my phone, but I swear I tested on my desktop yesterday and it worked. the only difference might be my desktop has python 3.7 and my phone has 3.6, possibly different versions of pandad too. I'll check when I get home in the afternoon
I just restarted my desktop and tried the code again from fresh, and it still works. I suspect it is the versions of packages. May I ask what versions of python and pandas? Mine is 3.7.0 and 0.23.4. I just did a quick check on github, seems pivot_table was added in pandas 0.20.

Collectives™ on Stack Overflow

how to compute correlation coefficient for multi-variable 1 column

2 Answers 2

4 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Linked

Related