2

I've identified as the bottleneck of my code the following operation on a given Pandas DataFrame df.

df.corr() 

I was wondering whether there exist some drop-in replacements to speed this step up?

Thank you!

2
  • maybe try numpy.corrcoef... ? eg. pd.DataFrame(np.corrcoef(df.to_numpy(), rowvar=False)) Commented Jun 17, 2019 at 9:27
  • Pandas is already nicely optimized. The only possible speedup is to directly use the underlying numpy arrays (possible small optimization) or to completly change the storage organization if relevant. Hard to say more with so little context... Commented Jun 17, 2019 at 9:34

1 Answer 1

16

You might try numpy.corrcoef instead:

pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns) 

Example Timings

# Setup np.random.seed(0) df = pd.DataFrame(np.random.randn(1000, 1000)) df.corr() # 15 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns) # 24.4 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, this indeed offers much faster computation. Thanks!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.