Computing correlation matrix faster in Pandas

Question

I've identified as the bottleneck of my code the following operation on a given Pandas DataFrame df.

df.corr()

I was wondering whether there exist some drop-in replacements to speed this step up?

Thank you!

maybe try numpy.corrcoef... ? eg. pd.DataFrame(np.corrcoef(df.to_numpy(), rowvar=False)) — Chris Adams
– Chris Adams, Commented Jun 17, 2019 at 9:27
Pandas is already nicely optimized. The only possible speedup is to directly use the underlying numpy arrays (possible small optimization) or to completly change the storage organization if relevant. Hard to say more with so little context... — Serge Ballesta
– Serge Ballesta, Commented Jun 17, 2019 at 9:34

Chris Adams · Accepted Answer · 2019-06-17 09:36:35Z

You might try numpy.corrcoef instead:

pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns)

Example Timings

# Setup np.random.seed(0) df = pd.DataFrame(np.random.randn(1000, 1000)) df.corr() # 15 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns) # 24.4 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Collectives™ on Stack Overflow

Computing correlation matrix faster in Pandas

1 Answer 1

Example Timings

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Example Timings

1 Comment

Related