0

I have a dataframe with following structure:

df <- data.frame( Replicate = c(rep("N1", 50), rep("N2", 50)), feature1 = rnorm(100, 0, 1), feature2 = rnorm(100, 0, 3), feature3 = rnorm(100, 0.1, 1) ) 

I am calculating the correlation between my (biological) replicates for each of my data columns (here "feature 1-3") with following code:

results_table <- data.frame(feature = NA, correlation = NA) for(i in colnames(df)[2:4]){ cor_i <- cor(df %>% filter(Replicate == "N1") %>% pull(i), df %>% filter(Replicate == "N2") %>% pull(i), use = "pairwise.complete") results_table_temp <- data.frame(feature = i, correlation = cor_i) results_table <- rbind(results_table, results_table_temp) } results_table <- results_table[2:nrow(results_table),] results_table 

I basically filter my initial dataframe for the respective replicate and calculate correlation between these replicate for each column (using a a for loop with cor() and store the output in dataframe).

For my dataset (240 rows with >7000 colums), the computing time is quite long! Is there a more efficient way to calculate this? Maybe a specific function or preprocessing of data to make the computation more efficient?

1 Answer 1

1

There are a few ways to do this:

Here is a simple way using summarize(across()) from dplyr

df %>% summarize(across( -Replicate, ~cor(.x[Replicate=="N1"], .x[Replicate=="N2"], use ="pairwise.complete") )) %>% t() 

Output:

 [,1] feature1 -0.035869831 feature2 -0.007740304 feature3 -0.051250907 

However, because it uses dplyr, it is unlikely to be very fast on your 7000 column dataset. At 7000x240, this is only ~1.7 million rows. Better to swing that table long, and group by feature.. Of course, use data.table for this, not dplyr. Here is an example with 7000 features

melt(dt, id="Replicate",variable.name = "feature")[ , .(correlation = cor(value[Replicate=="N1"], value[Replicate=="N2"], use="pairwise.complete")), by = feature] 

Output:

 feature correlation <fctr> <num> 1: V1 0.130108203 2: V2 -0.060648735 3: V3 0.109366966 4: V4 -0.279476904 5: V5 0.007624332 --- 6996: V6996 -0.112960673 6997: V6997 0.022128826 6998: V6998 -0.041648409 6999: V6999 -0.052700939 7000: V7000 -0.194483569 

Input:

set.seed(123) dt = lapply(1:7000, \(x) rnorm(n=240)) %>% as.data.table() dt[,Replicate:=c(rep("N1", 120), rep("N2", 120))] 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.