2

I need to create a vector with all numbers within ranges defined in a table. For example the rows 23:25 and 34:39 shall become the single vector c(23, 24, 25, 34, 35, 36, 37, 38, 39)

The MWE below does this, but is way too slow. I need to do it for n.rows of 15,000,000 or higher.

row.references is the input. row.references.long is the wanted output.

What is a better code to do this?

library(data.table) # Create example data n.rows <- 1000 row.references <- data.table(start.number=floor(runif(n=n.rows, min=1, max=100)), steps=floor(runif(n=n.rows, min=1, max=50))) row.references[, end.number:=start.number+steps] row.references[, steps:=NULL] row.references.long <- NULL # The too-slow code for (i in 1:nrow(row.references)) { row.references.long <- rbind(row.references.long, data.table(row.references[i, start.number]:row.references[i, end.number])) } 

I suppose data.table is the way to go.

4 Answers 4

9

For some reason, this still strikes me as overthinking things. I'm not sure if there's a big disadvantage to using by = 1:nrow(indt), but that gives me good performance.

My suggestion for "data.table" would simply be:

row.references[, list(V1 = start.number:end.number), by = 1:nrow(row.references)]$V1 

And for base R, would be:

unlist(mapply(":", row.references$start.number, row.references$end.number), use.names = FALSE) 

This latter one is similar to Roland's approach, but just uses : and unlist instead of do.call(c, ...)


Benchmarks

Here's your sample data:

library(data.table) set.seed(1) n.rows <- 1000 row.references <- data.table(start.number=floor(runif(n=n.rows, min=1, max=100)), steps=floor(runif(n=n.rows, min=1, max=50))) row.references[, end.number:=start.number+steps] row.references[, steps:=NULL] 

Here are a few functions to try out:

AM1 <- function() { unlist(mapply(":", row.references$start.number, row.references$end.number), use.names = FALSE) } AM2 <- function() { row.references[, list(V1 = start.number:end.number), by = 1:nrow(row.references)]$V1 } roland1 <- function() { do.call(c, mapply(seq, row.references[["start.number"]], row.references[["end.number"]], MoreArgs = list(by = 1))) } roland2 <- function(indt = copy(row.references)) { indt[, lengths := end.number - start.number + 1] res <- indt[, .(V1 = rep(as.integer(start.number) - 1L, times = lengths))] res[, V1 := V1 + seq_along(V1), by = rep(seq_len(nrow(indt)), indt[["lengths"]])]$V1 } jaap <- function(indt = copy(row.references)) { indt[, `:=` (idx=.I)][, .(var = seq(start.number,end.number)), by = idx]$var } 

Check that they are all equal:

sapply(c(quote(AM2()), quote(roland1()), quote(roland2()), quote(jaap())), function(x) all.equal(AM1(), eval(x))) # [1] TRUE TRUE TRUE TRUE 

Now, make some bigger data:

# Make the data bigger -- 2.5 million rows row.references <- rbindlist(replicate(2500, row.references, FALSE)) dim(row.references) 

Test out the timings:

system.time(AM1()) # user system elapsed # 6.936 0.000 6.845 system.time(AM2()) # user system elapsed # 2.480 0.212 2.800 system.time(roland1()) # user system elapsed # 64.932 0.000 63.525 system.time(roland2()) # user system elapsed # 3.488 0.000 2.434 system.time(jaap()) # user system elapsed # 14.068 0.000 13.643 

It seems like roland2 and AM2 are viable alternatives. Even if this "microbenchmark" is a bit off, I feel AM2 trumps in readability:

library(microbenchmark) microbenchmark(AM2(), roland2(), times = 20) # Unit: seconds # expr min lq mean median uq max neval # AM2() 2.202286 2.236027 2.323602 2.320230 2.394856 2.477074 20 # roland2() 2.314997 2.428790 2.502338 2.477764 2.589151 2.700195 20 
Sign up to request clarification or add additional context in comments.

2 Comments

Nice! I should have been able to come up with the AM2-solution as well :-( (but obviously didn't)
@Ananda Congrats for getting 100K :-) Welcome to the 100kers list.
4

Do not grow an object in a loop. Pre-allocate. Here is a more efficient version of your loop:

res <- do.call(c, mapply(seq, row.references[["start.number"]], row.references[["end.number"]], MoreArgs = list(by = 1))) all.equal(res, row.references.long[[1]]) #[1] TRUE 

Here is another option. Benchmark to see if it is faster.

row.references[, lengths := end.number - start.number + 1] res <- row.references[, .(V1 = rep(as.integer(start.number) - 1L, times = lengths))] res[, V1 := V1 + seq_along(V1), by = rep(seq_len(nrow(row.references)), row.references[["lengths"]])] all.equal(res, row.references.long) #[1] TRUE 

However, I would do this in compiled code, i.e., with Rcpp.

5 Comments

Why not unlist(mapply(":", row.references$start.number, row.references$end.number), use.names = FALSE)?
@AnandaMahto Sure, why not.
I suspect that unlist would be faster than do.call(c, ...).
@AnandaMahto it is indeed faster, see the benchmark in my answer
@Jaap, it won't scale as well. See my answer.
4

As @Roland mentioned, there is no need for a for-loop. You can do this entirely in data.table with the help of an index (idx) column:

set.seed(12) row.ref <- data.table(start.number=floor(runif(n=n.rows, min=1, max=100)), steps=floor(runif(n=n.rows, min=1, max=50))) row.ref[, `:=` (end.number=start.number+steps, idx=.I)] row.ref.l <- row.ref[, .(var = seq(start.number,end.number)), by = idx][, idx:=NULL] 

which results in:

> head(row.ref,3) start.number steps end.number idx 1: 7 3 10 1 2: 81 8 89 2 3: 94 40 134 3 > head(row.ref.l,15) var 1: 7 2: 8 3: 9 4: 10 5: 81 6: 82 7: 83 8: 84 9: 85 10: 86 11: 87 12: 88 13: 89 14: 94 15: 95 

A benchmark of the several proposed solutions:

microbenchmark(jaap = row.references[, .(var = seq(start.number,end.number)), by = idx], roland1 = do.call(c, mapply(seq,row.references[["start.number"]],row.references[["end.number"]],MoreArgs = list(by = 1))), roland2 = row.references[, lengths := end.number - start.number + 1][, .(V1 = rep(as.integer(start.number) - 1L, times = lengths))][, V1 := V1 + seq_along(V1), by = rep(seq_len(nrow(row.references)), row.references[["lengths"]])], ananda = data.table(unlist(mapply(":", row.references$start.number, row.references$end.number), use.names = FALSE)), times = 100, unit = "relative") 

which gives:

Unit: relative expr min lq mean median uq max neval cld jaap 5.517431 5.358356 5.168787 5.183828 5.164292 2.157907 100 c roland1 19.023350 18.029892 16.897831 17.475423 17.111857 5.682705 100 d roland2 2.051662 2.015041 1.912261 1.964078 1.957416 1.277075 100 b ananda 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a 

1 Comment

It's important to note that this is still a loop with nrow(row.ref) calls to an R function (with all the overhead of an interpreted language).
1

For one way to avoid explicit iteration, create a sequence spanning the entire range (use a numeric vector to avoid integer overflow), then correct elements by the offset required to make the elements corresponding to the start of each sequence, equal to the start of the sequence.

f <- function(start, step) { res <- seq(1, sum(step + 1), by=1) offset <- start - c(0, cumsum(step + 1)[-length(step)]) - 1L res + rep(offset, step + 1) } 

2 Comments

On a similar note (exchanging a lapply for less code): steps = abs(start - stop) + 1; rep(start, steps) + sequence(steps) - 1
@alexis_laz I didn't know about sequence(); it seems like a nice, reasonably performant solution, even if it involves iteration (lapply) under the hood.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.