2
Category SubCategory Month Value A A1 Jan 1 A A1 Feb 2 A A1 Mar 3 A A2 Jan 2 A A2 Feb 3 A A2 Mar 5 B B1 Jan 1 B B1 Feb 6 B B1 Mar 7 B B2 Jan 3 B B2 Feb 6 B B2 Mar 7 

I have a sample pandas df like this. I would like to compute the correlation coefficient between Subgroup Categories A1 and A2, B1 and B2, but not A1 and B1 etc. My end goal is to have a table such as this:

 A1 A2 B1 B2 A1 1.0000 0.9820 A2 0.9820 1.0000 B1 1.0000 0.9963 B2 0.9963 1.0000 

Can anyone help me with python code?

Obviously this one gives me a corr value of 1 for each SubCategory

df.groupby('SubCategory').corr() 
1
  • OP's data is not normalized, so it has the potential to be inconsistent. So solutions would rely on the fact that data is valid, e.g., if one of the "Feb" be replaced by a "Apr", the solution by W-B would break, while that by indominus would give a result without detecting the error (also potential dangerous) Commented Dec 13, 2018 at 6:26

2 Answers 2

3

1st it is pivot problem , then just using corr

pd.concat([x.pivot('Month','SubCategory','Value').corr() for _,x in df.groupby('Category')]) A1 A2 B1 B2 SubCategory A1 1.000000 0.981981 NaN NaN A2 0.981981 1.000000 NaN NaN B1 NaN NaN 1.000000 0.996271 B2 NaN NaN 0.996271 1.000000 
Sign up to request clarification or add additional context in comments.

4 Comments

may I know why Month is index? and x[month].values are not displayed in index
thank you for the answer, it works, however for the real-time application, I have 4700x 4700 matrix with mostly empty/NAN values. Is there any way to group these (create new dfs by category?)
Just skip the final pd.concat, and you'll get 2 separate dataframes, one for each category.
@rookiescientist just do not do concat
0

Data

import pandas as pd df = pd.DataFrame({"Category" : ["A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"], "SubCategory": ["A1", "A1", "A1", "A2", "A2", "A2", "B1", "B1", "B1", "B2", "B2", "B2"], "Value": [1, 2, 3, 2, 3, 5, 1, 6, 7, 3, 6, 7]}) 

Solution

import scipy as sp # this will contain a list of DataFrames storing the correlation matrices correlations = [] for g in df.groupby("Category"): sub_df = g[1][["SubCategory", "Value"]] data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list) correlation = pd.DataFrame(sp.corrcoef(data.values.tolist()[0]), columns=data.columns.values.tolist(), index=data.columns.values.tolist()) correlations.append(correlation) pd.concat(correlations, sort=False) 

Output

 A1 A2 B1 B2 ________________________________________________ A1 1.000000 0.996271 NaN NaN A2 0.996271 1.000000 NaN NaN B1 NaN NaN 1.000000 0.996271 B2 NaN NaN 0.996271 1.000000 

Update

This solution was tested on python and pandas versions as shown below, older versions may not work:

from platform import python_version print('python version:', python_version()) import pandas as pd print('pandas version:', pd.__version__) python version: 3.7.0 pandas version: 0.23.4 

4 Comments

i copy pasted both of your code and it gives me this error: ValueError: Shape of passed values is (2, 2), indices imply (2, 1)
on this line: data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list)
hmm, I just got the same error on my phone, but I swear I tested on my desktop yesterday and it worked. the only difference might be my desktop has python 3.7 and my phone has 3.6, possibly different versions of pandad too. I'll check when I get home in the afternoon
I just restarted my desktop and tried the code again from fresh, and it still works. I suspect it is the versions of packages. May I ask what versions of python and pandas? Mine is 3.7.0 and 0.23.4. I just did a quick check on github, seems pivot_table was added in pandas 0.20.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.