Toy data.table
Consider this data.table
library(pacman) p_load(data.table,magrittr,dplyr,glue) dt <- data.table(x = c(1,3,4,5,8,12,13,20,21,25), y = c(1,1,2,2,8,2,4,6,5,5),keep.rownames = T) dt[,newval:=NA_real_] dt[,rn:=as.integer(rownames(dt))] dt[1,newval:=y] dt[,x_pre := shift(x,n = 1)] dt[,x_nxt := shift(x,n = -1)] setcolorder(dt,"rn") dt[] #> rn x y newval x_pre x_nxt #> 1: 1 1 1 1 NA 3 #> 2: 2 3 1 NA 1 4 #> 3: 3 4 2 NA 3 5 #> 4: 4 5 2 NA 4 8 #> 5: 5 8 8 NA 5 12 #> 6: 6 12 2 NA 8 13 #> 7: 7 13 4 NA 12 20 #> 8: 8 20 6 NA 13 21 #> 9: 9 21 5 NA 20 25 #> 10: 10 25 5 NA 21 NA # note the last 2 columns are simply the shifted values of x Use of for loop
The following is an inefficient function using for loop in R for bootstrapping a data.tale column.
# function using a for loop on each observation func_loop <- function(dt){ # create a for loop for updating the newval column iteratively for(i in seq_len(nrow(dt))[-c((nrow(dt) - c(0:1)))]){ dt[i + 2,newval:=y] # temporary value to be erased later dt[,new_pre:=shift(newval, n = 1)] dt[,new_nxt:=shift(newval, n = -1)] # the following line of code uses the previously computed value (new_pre) dt[rn > 1,newval:=ifelse(rn==i+1, new_pre + (new_nxt - new_pre)* (x - x_pre) /((x_nxt - x_pre)),newval) ] dt[rn==i+2,newval:=NA_real_] } dt } Call the for - loop function
# call the function func_loop(dt)[] #> rn x y newval x_pre x_nxt new_pre new_nxt #> 1: 1 1 1 1.000000 NA 3 NA 1.666667 #> 2: 2 3 1 1.666667 1 4 1.000000 1.833333 #> 3: 3 4 2 1.833333 3 5 1.666667 3.375000 #> 4: 4 5 2 3.375000 4 8 1.833333 2.785714 #> 5: 5 8 8 2.785714 5 12 3.375000 3.757143 #> 6: 6 12 2 3.757143 8 13 2.785714 4.037500 #> 7: 7 13 4 4.037500 12 20 3.757143 4.879688 #> 8: 8 20 6 4.879688 13 21 4.037500 NA #> 9: 9 21 5 4.903750 20 25 4.879688 5.000000 #> 10: 10 25 5 NA 21 NA NA NA # benchmark the speed microbenchmark::microbenchmark(func_loop(dt)) #> Unit: milliseconds #> expr min lq mean median uq max neval #> func_loop(dt) 23.00165 24.24735 26.19917 25.11379 27.11327 39.43801 100 Created on 2022-07-19 by the reprex package (v2.0.1)
Expectedly this gives a terrible efficiency of 30 msec for 10 rows, which means for a million rows it will take 50 minutes. I have several million rows to be computed.
I am aware of froll* series and use them extensively but here I am unable to apply frollapply, since this algo has a dependency on previous computation.
I have tried data.table::set also and that does'nt reduce drastically the time due to the fact that we have to call dt[] repeatedly which is an expensive call. See Henrik's comments below.
I am looking to improve the performance by several orders of magnitude and not just 20 or 40%. I would expect a 1/10th or 1/50th of the current response times with a good vector algorithm.
setis lost because you call[data.table(dt[) in each iteration, which has a large overhead. I think your post would benefit from an explanation in words what you try to achieve and a demonstration on a small (10-ish rows) data set. Then it's much easier for people to try out alternatives, which can be applied to larger data. That said, it seems like you want to do calculations on a moving window (rn + 1:10). If so,shiftmay be an alternative. E.g. sum the two next (leading) values in a vector:Reduce(`+`, shift(v, 1:2, type = "lead"))?set: "[withset] overhead of[.data.tableis avoided ... However, normally, we call[.data.tableonce on large data, not many times on small data.", which unfortunately happens in your loop (and in the various*apply"loops" in the current answer)set(), it sounds a bit like an (possible) XY-problem. You discard thefrollsum/-apply-family, but this is exactly what gives you the major performance boost you are looking for (at least in this toy-example). Om my system, the aboveforloop takes 1.2 seconds, and thefrollsum-alternative takes 400 micro-seconds. Can you eleborate more on your production-data, and why an froll-approach is not feasible?dt[, newval2 := shift(frollsum(value < 90, n = 10, align = "left"), n = 1, type = "lead")])not work? Runtime:0.000478seconds.. Looks like there is a lot to be gained when able to code towards thefroll-approach.frollapplyis not feasible here (or atleast I am unable to utilise it). I need the values to be iteratively using the previous values. In anfrollapplyor any vector solution e.g.shiftall values are computed simultaneously. I am posting the updated question now. Andfrollapply/` shift` based solutions are also welcome, as ideally they will be the fastest.