I've identified as the bottleneck of my code the following operation on a given Pandas DataFrame df.
df.corr() I was wondering whether there exist some drop-in replacements to speed this step up?
Thank you!
You might try numpy.corrcoef instead:
pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns) # Setup np.random.seed(0) df = pd.DataFrame(np.random.randn(1000, 1000)) df.corr() # 15 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) pd.DataFrame(np.corrcoef(df.values, rowvar=False), columns=df.columns) # 24.4 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy.corrcoef... ? eg.pd.DataFrame(np.corrcoef(df.to_numpy(), rowvar=False))