Optimizing correlation calculation between (biological) replicates in R

Question

I have a dataframe with following structure:

df <- data.frame( Replicate = c(rep("N1", 50), rep("N2", 50)), feature1 = rnorm(100, 0, 1), feature2 = rnorm(100, 0, 3), feature3 = rnorm(100, 0.1, 1) )

I am calculating the correlation between my (biological) replicates for each of my data columns (here "feature 1-3") with following code:

results_table <- data.frame(feature = NA, correlation = NA) for(i in colnames(df)[2:4]){ cor_i <- cor(df %>% filter(Replicate == "N1") %>% pull(i), df %>% filter(Replicate == "N2") %>% pull(i), use = "pairwise.complete") results_table_temp <- data.frame(feature = i, correlation = cor_i) results_table <- rbind(results_table, results_table_temp) } results_table <- results_table[2:nrow(results_table),] results_table

I basically filter my initial dataframe for the respective replicate and calculate correlation between these replicate for each column (using a a for loop with cor() and store the output in dataframe).

For my dataset (240 rows with >7000 colums), the computing time is quite long! Is there a more efficient way to calculate this? Maybe a specific function or preprocessing of data to make the computation more efficient?

langtang · Accepted Answer · 2022-07-19 12:24:40Z

There are a few ways to do this:

Here is a simple way using summarize(across()) from dplyr

df %>% summarize(across( -Replicate, ~cor(.x[Replicate=="N1"], .x[Replicate=="N2"], use ="pairwise.complete") )) %>% t()

Output:

 [,1] feature1 -0.035869831 feature2 -0.007740304 feature3 -0.051250907

However, because it uses dplyr, it is unlikely to be very fast on your 7000 column dataset. At 7000x240, this is only ~1.7 million rows. Better to swing that table long, and group by feature.. Of course, use data.table for this, not dplyr. Here is an example with 7000 features

melt(dt, id="Replicate",variable.name = "feature")[ , .(correlation = cor(value[Replicate=="N1"], value[Replicate=="N2"], use="pairwise.complete")), by = feature]

Output:

 feature correlation <fctr> <num> 1: V1 0.130108203 2: V2 -0.060648735 3: V3 0.109366966 4: V4 -0.279476904 5: V5 0.007624332 --- 6996: V6996 -0.112960673 6997: V6997 0.022128826 6998: V6998 -0.041648409 6999: V6999 -0.052700939 7000: V7000 -0.194483569

Input:

set.seed(123) dt = lapply(1:7000, \(x) rnorm(n=240)) %>% as.data.table() dt[,Replicate:=c(rep("N1", 120), rep("N2", 120))]

Collectives™ on Stack Overflow

Optimizing correlation calculation between (biological) replicates in R

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related