R: Rolling calculation of column values (avoid loop)

Question

I want to incrementally grow a new column, based on values of the previous row & same column. You could do it with a loop, like so:

df <- data.frame(a = 2000:2010, b = 10:20, c = seq(1000, 11000, 1000), x = 1000) for(i in 2:nrow(df)) df$x[i] <- (df$c[i]) * df$a[i-1] / df$x[i-1] + df$b[i] * df$a[i] df a b c x 1 2000 10 1000 1000.00 2 2001 11 2000 26011.00 3 2002 12 3000 24254.79 4 2003 13 4000 26369.16 5 2004 14 5000 28435.80 6 2005 15 6000 30497.85 7 2006 16 7000 32556.20 8 2007 17 8000 34611.93 9 2008 18 9000 36665.87 10 2009 19 10000 38718.65 11 2010 20 11000 40770.76

(As you see, new values in column x use values of column x of the previous row.)

However, as I do this for a Shiny app, I need to have fast calculation, thus using loops is not optimal. Is there a way of doing this which avoids loops, ideally making use of dplyr's piping? This reply (Referring to previous row in calculation) suggests a way using sapply - however, I am unable to do this mathematically...

Moving parts of your calculation outside of the loop should speed things up quite a bit, regardless of the looping construct you choose. — user10191355
– user10191355, Commented Jan 1, 2020 at 10:28

Community · Accepted Answer · 2020-06-20 09:12:55Z

There are a few options.

Use vectors

During each loop, it's expensive to do df$x because it takes memory to do it. Instead, you can pre-assign vectors and subset the vectors.

#easiest - extract the vectors before the loop C <- df[['c']] #used big C because c() is a function a <- df[['a']] b <- df[['b']] x <- df[['x']] for(i in seq_along(x)[-1]) x[i] <- C[i] * a[i-1] / x[i-1L] + b[i] * a[i]

Use a function

Turning your loop into a function will improve performance due to the optimization from compiling.

f_recurse = function(a, b, C, x){ for (i in seq_along(x)[-1]) x[i] <- C[i] * a[i-1] / x[i-1L] + b[i] * a[i] x } f_recurse(df$a, df$b, df$c, df$x)

Use Rcpp

Finally, if the response is still too laggy, you can try to use Rcpp. Note, Rcpp updates in place so while I return a vector, there's really no need - the df$x has also been updated.

library(Rcpp) cppFunction(' NumericVector f_recurse_rcpp(IntegerVector a, IntegerVector b, NumericVector C, NumericVector x){ for (int i = 1; i < x.size(); ++i){ x[i] = C[i] * a[i-1] / x[i - 1] + b[i] * a[i]; } return(x); } ') f_recurse_rcpp(df$a, df$b, df$c, df$x)

Performance

In all, we get close to a 1,000 times performance increase. The table below is from bench::mark which also checks for equality.

# A tibble: 4 x 13 expression min median `itr/sec` mem_alloc <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> 1 OP 8.27ms 8.8ms 106. 62.04KB 2 extract 6.21ms 7.49ms 126. 46.16KB 3 f_recurse(df$a, df$b, df$c, df$x) 13.1us 28.8us 33295. 0B 4 f_recurse_rcpp(df$a, df$b, df$c, df$x) 8.6us 10us 98240. 2.49KB

And here's an example with a 1,000 row data.frame and then 10,000 row

df <- data.frame(a = sample(1000L), b = sample(1001:2000), c = seq(1000, 11000, length.out = 1000), x = rep(3, 1000L)) # A tibble: 4 x 13 expression min median `itr/sec` mem_alloc <bch:expr> <bch:t> <bch:tm> <dbl> <bch:byt> 1 OP 23.9ms 24.38ms 39.4 7.73MB 2 extract 6.5ms 7.71ms 123. 69.84KB 3 f_recurse(df$a, df$b, df$c, df$x) 265.7us 271.9us 3596. 23.68KB 4 f_recurse_rcpp(df$a, df$b, df$c, df$x) 17.4us 18.9us 51845. 2.49KB df <- data.frame(a = sample(10000L), b = sample(10001:20000), c = seq(1000, 11000, length.out = 10000), x = rep(3, 10000L)) # A tibble: 4 x 13 expression min median `itr/sec` mem_alloc <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> 1 OP 353.17ms 412.62ms 2.42 763.38MB 2 extract 8.75ms 8.95ms 107. 280.77KB 3 f_recurse(df$a, df$b, df$c, df$x) 2.58ms 2.61ms 376. 234.62KB 4 f_recurse_rcpp(df$a, df$b, df$c, df$x) 98.6us 112.7us 8169. 2.49KB

Great answer! I worked my brain out but can not solve this question.Could you tell me how did you learn that ` loop into a function will improve performance` and the seq_along methods? From any book or resources?
The seq_along isn't anything special - but I have seen enough posts here to know users sometimes prefer it to 2:length(x) in cases where the length of x is 0. seq_along(x)[-1] wouldn't produce an error. As for the second, see this classic post: stackoverflow.com/questions/2908822/…

Collectives™ on Stack Overflow

R: Rolling calculation of column values (avoid loop)

1 Answer 1

Use vectors

Use a function

Use Rcpp

Performance

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Use vectors

Use a function

Use Rcpp

Performance

2 Comments

Linked

Related