0

I have many dataframes. They all share the same column structure "date", "open_position_profit", "more columns...".

 date open_position_profit col2 col3 0 2008-04-01 -260.0 1 290.0 1 2008-04-02 -340.0 1 -60.0 2 2008-04-03 100.0 1 40.0 3 2008-04-04 180.0 1 -90.0 4 2008-04-05 0.0 0 0.0 0.0 1 

Although "date" is present in all dataframes, they might or might not have the same count (some dates might be in one dataframe but not the other).

I want to compute a correlation matrix of the columns "open_position_profit" of all these dataframes.

I've tried this

dfs = [df1[["date", "open_position_profit"]], df2[["date", "open_position_profit"]], ...] pd.concat(dfs).groupby('date', as_index=False).corr() 

But this gives me a series of the correlation for each cell:

 open_position_profit 0 open_position_profit 1.0 1 open_position_profit 1.0 2 open_position_profit 1.0 3 open_position_profit 1.0 4 open_position_profit NaN 

I want the correlation for the entire time series, not each single cell. How can I do this?

1 Answer 1

2

If I understand your intention correctly, it is necessary to do outer join first. The following code does outer join by date key. The missing value can be represented by NaN.

df = pd.merge(df1, df2, on='date', how='outer') date open_position_profit_x open_position_profit_y ... ... 0 2019-01-01 ... 1 2019-01-02 ... 2 2019-01-03 ... 3 2019-01-04 ... 

Then you can calculate the correlation with the new DataFrame.

df.corr() open_position_profit_x open_position_profit_y ... ... open_position_profit_x 1.000000 0.866025 open_position_profit_y 0.866025 1.000000 ... 1.000000 1.000000 ... 1.000000 1.000000 

See: pd.merge

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.