7

I have a dataframe with 49 variables and 4M rows. I want to calculate the correlation matrix of 49 x 49. All columns are of class numeric.

Here's a sample :

df <- data.frame(replicate(49,sample(0:50,4000000,rep=TRUE))) 

I used the standard cor function.

cor_matrix <- cor(df, use = "pairwise.complete.obs") 

This is taking a really long time. I have 16GB RAM and an i5 single core 2.60Ghz.

Is there a way to make this calculation faster on my desktop?

2
  • 1
    You may check here Commented Mar 21, 2016 at 16:14
  • 1
    Your main problem is use = "pairwise.complete.obs". On my system (tested with 12 columns) that takes five times as long as use = "everything". Commented Mar 21, 2016 at 16:26

1 Answer 1

8

There's a faster version of the cor function in the WGCNA package (used for inferring gene networks based on correlations). On my 3.1 GHz i7 w/ 16 GB of RAM it can solve the same 49 x 49 matrix about 20x faster:

mat <- replicate(49, as.numeric(sample(0:50,4000000,rep=TRUE))) system.time( cor_matrix <- cor(mat, use = "pairwise.complete.obs") ) user system elapsed 40.391 0.017 40.396 system.time( cor_matrix_w <- WGCNA::cor(mat, use = "pairwise.complete.obs") ) user system elapsed 1.822 0.468 2.290 all.equal(cor_matrix, cor_matrix_w) [1] TRUE 

Check the helpfile for the function for details on differences between versions when your data contains more missing observations.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.