For some reason, this still strikes me as overthinking things. I'm not sure if there's a big disadvantage to using by = 1:nrow(indt), but that gives me good performance.
My suggestion for "data.table" would simply be:
row.references[, list(V1 = start.number:end.number), by = 1:nrow(row.references)]$V1
And for base R, would be:
unlist(mapply(":", row.references$start.number, row.references$end.number), use.names = FALSE)
This latter one is similar to Roland's approach, but just uses : and unlist instead of do.call(c, ...)
Benchmarks
Here's your sample data:
library(data.table) set.seed(1) n.rows <- 1000 row.references <- data.table(start.number=floor(runif(n=n.rows, min=1, max=100)), steps=floor(runif(n=n.rows, min=1, max=50))) row.references[, end.number:=start.number+steps] row.references[, steps:=NULL]
Here are a few functions to try out:
AM1 <- function() { unlist(mapply(":", row.references$start.number, row.references$end.number), use.names = FALSE) } AM2 <- function() { row.references[, list(V1 = start.number:end.number), by = 1:nrow(row.references)]$V1 } roland1 <- function() { do.call(c, mapply(seq, row.references[["start.number"]], row.references[["end.number"]], MoreArgs = list(by = 1))) } roland2 <- function(indt = copy(row.references)) { indt[, lengths := end.number - start.number + 1] res <- indt[, .(V1 = rep(as.integer(start.number) - 1L, times = lengths))] res[, V1 := V1 + seq_along(V1), by = rep(seq_len(nrow(indt)), indt[["lengths"]])]$V1 } jaap <- function(indt = copy(row.references)) { indt[, `:=` (idx=.I)][, .(var = seq(start.number,end.number)), by = idx]$var }
Check that they are all equal:
sapply(c(quote(AM2()), quote(roland1()), quote(roland2()), quote(jaap())), function(x) all.equal(AM1(), eval(x))) # [1] TRUE TRUE TRUE TRUE
Now, make some bigger data:
# Make the data bigger -- 2.5 million rows row.references <- rbindlist(replicate(2500, row.references, FALSE)) dim(row.references)
Test out the timings:
system.time(AM1()) # user system elapsed # 6.936 0.000 6.845 system.time(AM2()) # user system elapsed # 2.480 0.212 2.800 system.time(roland1()) # user system elapsed # 64.932 0.000 63.525 system.time(roland2()) # user system elapsed # 3.488 0.000 2.434 system.time(jaap()) # user system elapsed # 14.068 0.000 13.643
It seems like roland2 and AM2 are viable alternatives. Even if this "microbenchmark" is a bit off, I feel AM2 trumps in readability:
library(microbenchmark) microbenchmark(AM2(), roland2(), times = 20) # Unit: seconds # expr min lq mean median uq max neval # AM2() 2.202286 2.236027 2.323602 2.320230 2.394856 2.477074 20 # roland2() 2.314997 2.428790 2.502338 2.477764 2.589151 2.700195 20