1

This is an optimization problem that I'm hoping you creative SO users may have an answer to.

I have a large matrix (5 million x 2) with two values: time and type. In essence, each "type" is its own time series -- the below data represents three different time series (one for A, one for B, and one for C). There are 2000 different "types".

mat time type [1,] 50 A [2,] 50 A [3,] 12 B [4,] 24 B [5,] 80 B [6,] 92 B [7,] 43 C [8,] 69 C 

What is the most efficient way for me to find the correlation between these 2000 time series? I am currently producing a matrix where there are different bins for each time where an event could have occurred, and I populate that matrix with how many events of each "type" occurred in that time slot. After populating that matrix, I loop over each pair of "type"s and find the correlations. This is extremely inefficient (~5 hours).

My whole problem could be solved if there exists a way to implement a by='type' feature in the cor function of R?

Thanks for any insight.

1 Answer 1

5

You can try something like this

set.seed(1) df <- data.frame(time = rnorm(15), type = rep(c("a", "b", "c"), each = 5)) cor(do.call(cbind, split(df$time, df$type))) a b c a 1.00000 0.27890 -0.61497 b 0.27890 1.00000 -0.78641 c -0.61497 -0.78641 1.00000 

This approach assume that the number of observations per type is balanced.

Now, we can do a real test with 5 millions rows and 2000 differents types

set.seed(1) df <- data.frame(time = rnorm(5e6), type = sample(rep(1:2000, each = 2500))) system.time(cor(do.call(cbind, split(df$time, df$type)))) ## user system elapsed ## 6.387 0.000 6.391 
Sign up to request clarification or add additional context in comments.

1 Comment

this is a great solution but unfortunately my data set does not contain balanced number of observations. do you know whether there is a way to tweak it?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.