0

Toy data.table

Consider this data.table

library(pacman) p_load(data.table,magrittr,dplyr,glue) dt <- data.table(x = c(1,3,4,5,8,12,13,20,21,25), y = c(1,1,2,2,8,2,4,6,5,5),keep.rownames = T) dt[,newval:=NA_real_] dt[,rn:=as.integer(rownames(dt))] dt[1,newval:=y] dt[,x_pre := shift(x,n = 1)] dt[,x_nxt := shift(x,n = -1)] setcolorder(dt,"rn") dt[] #> rn x y newval x_pre x_nxt #> 1: 1 1 1 1 NA 3 #> 2: 2 3 1 NA 1 4 #> 3: 3 4 2 NA 3 5 #> 4: 4 5 2 NA 4 8 #> 5: 5 8 8 NA 5 12 #> 6: 6 12 2 NA 8 13 #> 7: 7 13 4 NA 12 20 #> 8: 8 20 6 NA 13 21 #> 9: 9 21 5 NA 20 25 #> 10: 10 25 5 NA 21 NA # note the last 2 columns are simply the shifted values of x 

Use of for loop

The following is an inefficient function using for loop in R for bootstrapping a data.tale column.

 # function using a for loop on each observation func_loop <- function(dt){ # create a for loop for updating the newval column iteratively for(i in seq_len(nrow(dt))[-c((nrow(dt) - c(0:1)))]){ dt[i + 2,newval:=y] # temporary value to be erased later dt[,new_pre:=shift(newval, n = 1)] dt[,new_nxt:=shift(newval, n = -1)] # the following line of code uses the previously computed value (new_pre) dt[rn > 1,newval:=ifelse(rn==i+1, new_pre + (new_nxt - new_pre)* (x - x_pre) /((x_nxt - x_pre)),newval) ] dt[rn==i+2,newval:=NA_real_] } dt } 

Call the for - loop function

 # call the function func_loop(dt)[] #> rn x y newval x_pre x_nxt new_pre new_nxt #> 1: 1 1 1 1.000000 NA 3 NA 1.666667 #> 2: 2 3 1 1.666667 1 4 1.000000 1.833333 #> 3: 3 4 2 1.833333 3 5 1.666667 3.375000 #> 4: 4 5 2 3.375000 4 8 1.833333 2.785714 #> 5: 5 8 8 2.785714 5 12 3.375000 3.757143 #> 6: 6 12 2 3.757143 8 13 2.785714 4.037500 #> 7: 7 13 4 4.037500 12 20 3.757143 4.879688 #> 8: 8 20 6 4.879688 13 21 4.037500 NA #> 9: 9 21 5 4.903750 20 25 4.879688 5.000000 #> 10: 10 25 5 NA 21 NA NA NA # benchmark the speed microbenchmark::microbenchmark(func_loop(dt)) #> Unit: milliseconds #> expr min lq mean median uq max neval #> func_loop(dt) 23.00165 24.24735 26.19917 25.11379 27.11327 39.43801 100 

Created on 2022-07-19 by the reprex package (v2.0.1)

Expectedly this gives a terrible efficiency of 30 msec for 10 rows, which means for a million rows it will take 50 minutes. I have several million rows to be computed.

I am aware of froll* series and use them extensively but here I am unable to apply frollapply, since this algo has a dependency on previous computation.

I have tried data.table::set also and that does'nt reduce drastically the time due to the fact that we have to call dt[] repeatedly which is an expensive call. See Henrik's comments below.

I am looking to improve the performance by several orders of magnitude and not just 20 or 40%. I would expect a 1/10th or 1/50th of the current response times with a good vector algorithm.

6
  • 6
    In your code any benefit of using set is lost because you call [data.table (dt[) in each iteration, which has a large overhead. I think your post would benefit from an explanation in words what you try to achieve and a demonstration on a small (10-ish rows) data set. Then it's much easier for people to try out alternatives, which can be applied to larger data. That said, it seems like you want to do calculations on a moving window (rn + 1:10). If so, shift may be an alternative. E.g. sum the two next (leading) values in a vector: Reduce(`+`, shift(v, 1:2, type = "lead")) Commented Jul 16, 2022 at 10:24
  • 3
    I just saw that what I wrote above is also described in ?set: "[with set] overhead of [.data.table is avoided ... However, normally, we call [.data.table once on large data, not many times on small data.", which unfortunately happens in your loop (and in the various *apply "loops" in the current answer) Commented Jul 16, 2022 at 11:17
  • 3
    Now that @Henrik explained the limitations of set(), it sounds a bit like an (possible) XY-problem. You discard the frollsum/-apply-family, but this is exactly what gives you the major performance boost you are looking for (at least in this toy-example). Om my system, the above for loop takes 1.2 seconds, and the frollsum-alternative takes 400 micro-seconds. Can you eleborate more on your production-data, and why an froll-approach is not feasible? Commented Jul 16, 2022 at 11:17
  • 4
    what I mean is; why should dt[, newval2 := shift(frollsum(value < 90, n = 10, align = "left"), n = 1, type = "lead")]) not work? Runtime: 0.000478 seconds.. Looks like there is a lot to be gained when able to code towards the froll-approach. Commented Jul 16, 2022 at 11:27
  • I have made changes to the code to make it closer to the production code. You will see why frollapply is not feasible here (or atleast I am unable to utilise it). I need the values to be iteratively using the previous values. In an frollapply or any vector solution e.g. shift all values are computed simultaneously. I am posting the updated question now. And frollapply /` shift` based solutions are also welcome, as ideally they will be the fastest. Commented Jul 16, 2022 at 15:42

1 Answer 1

2

A simple Rcpp function will be much faster.

library(data.table) Rcpp::cppFunction( "NumericVector iterInterp(const NumericVector& x, const NumericVector& y) { const int n = x.size(); NumericVector newval(n); newval(0) = y(0); newval(n - 1) = NA_REAL; for (int i = 1; i < n - 1; i++) { newval(i) = newval(i - 1) + (y(i + 1) - newval(i - 1))*(x(i) - x(i - 1))/(x(i + 1) - x(i - 1)); } return newval; }" ) dt <- data.table( x = c(1,3,4,5,8,12,13,20,21,25), y = c(1,1,2,2,8,2,4,6,5,5) ) microbenchmark::microbenchmark(iterInterp = dt[, newval := iterInterp(x, y)]) #> Unit: microseconds #> expr min lq mean median uq max neval #> iterInterp 153.5 156.9 164.894 159.7 163.8 391.8 100 dt #> x y newval #> 1 1 1 1.000000 #> 2 3 1 1.666667 #> 3 4 2 1.833333 #> 4 5 2 3.375000 #> 5 8 8 2.785714 #> 6 12 2 3.757143 #> 7 13 4 4.037500 #> 8 20 6 4.879688 #> 9 21 5 4.903750 #> 10 25 5 NA 

That comes out to < 3 minutes for 10M rows, except the overhead does not scale with the size of the data.table, as shown by benchmarking:

dt <- data.table( x = rep(c(1,3,4,5,8,12,13,20,21,25), 1e6) + 25*rep(0:(1e6 - 1L), each = 10), y = rep(c(1,1,2,2,8,2,4,6,5,5), 1e6) ) microbenchmark::microbenchmark(iterInterp = dt[, newval := iterInterp(x, y)]) #> Unit: milliseconds #> expr min lq mean median uq max neval #> iterInterp 157.585 159.0541 178.3298 168.0882 172.2245 274.102 100 

That's a fraction of a second for 10M rows.

Sign up to request clarification or add additional context in comments.

7 Comments

Rcpp would also have been my next recommendation when solving this with data.table isn't straightforward. +1
Terrific speed ! This is enough motivation to learn C++. By the way my production code needs approxfun() of R. By any chance is there a C++ replacement for the function? Anyway I have accepted your solution @jblood94. Also thanks @Waldi for your efforts and thanks to @Henrik for the immediate red flag on the use of an inefficient set.
@LazarusThurston, example of integration of approxfun in Rcpp : stackoverflow.com/q/29463754/13513328
@LazarusThurston I'm curious as to why approxfun would be required, since it is simply performing linear interpolation, for which the formula (which you used in your question) is straightforward.
You are right, i think I can live with the existing code instead of approxfun. However I do need to replace y(i+1) with the k rolling median of y(i+n) (the reprex used a simplistic value). I searched and found qsort as the C function for sorting and then take the centered value for median but can't make it work in my production code with RCPP. Any help would be appreciated.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.