7

What is the idiomatic way to collect results in a loop in R if the number of final results is not known beforehand? Here's a toy example:

results = vector('integer') i=1L while (i < bigBigBIGNumber) { if (someCondition(i)) results = c(results, i) i = i+1 } results 

The problem with this example is that (I assume) it will have quadratic complexity as the vector needs to be re-allocated at every append. (Is this correct?) I'm looking for a solution that avoids this.

I found Filter, but it requires pre-generating 1:bigBigBIGNumber which I want to avoid to conserve memory. (Question: does for (i in 1:N) also pre-generate 1:N and keep it in memory?)

I could make something like a linked list like this:

results = list() i=1L while (i < bigBigBIGNumber) { if (someCondition(i)) results = list(results, i) i = i+1 } unlist(results) 

(Note that this is not concatenation. It's building a structure like list(list(list(1),2),3), then flattening with unlist.)

Is there a better way than this? What is the idiomatic way that's typically used? (I am very new to R.) I'm looking for suggestion on how to tackle this type of problem. Suggestions both about compact (easy to write) and fast code are most welcome! (But I'd like to focus on fast and memory efficient.)

12
  • 2
    The c function is used to extend either vectors or lists. If you can estimate the size then allocating with vector("integer", size) will help reduce the cost of extending. Commented May 12, 2013 at 22:25
  • @DWin Are there existing tools which extend the array in a smart way, on demand? (E.g. double the size of the preallocated array once its capacity has been reached, and avoid quadratic complexity) Commented May 12, 2013 at 22:28
  • @Szabolcs, why do you think replacing c with list would help here? Unless you preallocate a list, the same problem persists, isn't it? Commented May 12, 2013 at 22:36
  • @Arun Notice that c concatenates (which triggers a re-allocation of the vector) while with list I am building a structure like list(list(list(1),2),3), a linked list. When put in a loop, the latter has linear complexity, the former has a quadratic one. You can easily verify this by a small benchmark: doubling the number of elements to be appended doubles the time with list, and nearly quadruples it with c. This means that for a "large enough" result, the list approach will always be faster. What I did not realize when I wrote the question is that in this case "large enough" ... Commented May 12, 2013 at 22:41
  • 1
    For now, I hope this answer might partially help, in that R lists are like hash map data structures.. Commented May 12, 2013 at 23:01

4 Answers 4

6

Here is an algorithm that doubles the size of the output list as it fills up, achieving somewhat linear computation times as show the benchmark tests:

test <- function(bigBigBIGNumber = 1000) { n <- 10L results <- vector("list", n) m <- 0L i <- 1L while (i < bigBigBIGNumber) { if (runif(1) > 0.5) { m <- m + 1L results[[m]] <- i if (m == n) { results <- c(results, vector("list", n)) n <- n * 2L } } i = i + 1L } unlist(results) } system.time(test(1000)) # user system elapsed # 0.008 0.000 0.008 system.time(test(10000)) # user system elapsed # 0.090 0.002 0.093 system.time(test(100000)) # user system elapsed # 0.885 0.051 0.936 system.time(test(1000000)) # user system elapsed # 9.428 0.339 9.776 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, this is very practical, so I'll accept it, but the other answers/comments helped too in understanding what people consider idiomatic in R.
I guess the linearity is really overhead in the loop (generating random numbers, assigning results, etc); the time for growing is equal to just (e.g., for 2^20 elements) system.time({ x = integer(1); for (i in 1:19) x <- c(x, integer(2^i)) }) (a fraction of a second).
4

Presumably there's a maximum size you're willing to tolerate; pre-allocate and fill up to that level, then trim if necessary. This avoids the risk of not being able to satisfy the request to double in size, even when only a small additional amount of memory might be required; it fails early, and involves only one rather than log(n) re-allocations. Here's a function that takes a maximum size, a generating function, and a token that the generating function returns when there is nothing left to generate. We get up to n results before returning

filln <- function(n, FUN, ..., RESULT_TYPE="numeric", DONE_TOKEN=NA_real_) { results <- vector(RESULT_TYPE, n) i <- 0L while (i < n) { ans <- FUN(..., DONE_TOKEN=DONE_TOKEN) if (identical(ans, DONE_TOKEN)) break i <- i + 1L results[[i]] <- ans } if (i == n) warning("intolerably large result") else length(results) <- i results } 

Here's a generator

fun <- function(thresh, DONE_TOKEN) { x <- rnorm(1) if (x > thresh) DONE_TOKEN else x } 

and in action

> set.seed(123L); length(filln(10000, fun, 3)) [1] 163 > set.seed(123L); length(filln(10000, fun, 4)) [1] 10000 Warning message: In filln(10000, fun, 4) : intolerably large result > set.seed(123L); length(filln(100000, fun, 4)) [1] 23101 

We can benchmark the overhead, approximately, by comparing to something that knows in advance how much space is required

f1 <- function(n, FUN, ...) { i <- 0L result <- numeric(n) while (i < n) { i <- i + 1L result[i] <- FUN(...) } result } 

Here we check the timing and value of a single result

> set.seed(123L); system.time(res0 <- filln(100000, fun, 4)) user system elapsed 0.944 0.000 0.948 > set.seed(123L); system.time(res1 <- f1(23101, fun, 4)) user system elapsed 0.688 0.000 0.689 > identical(res0, res1) [1] TRUE 

which for this example is of course eclipsed by the simple vector solution(s)

set.seed(123L); system.time(res2 <- rnorm(23101)) identical(res0, res2) 

Comments

2

If you can't compute 1:bigBigNumber, count the entries, create the vector, then populate it.

num <- 0L i <- 0L while (i < bigBigNumber) { if (someCondition(i)) num <- num + 1L i <- i + 1L } result <- integer(num) num <- 0L while (i < bigBigNumber) { if (someCondition(i)) { result[num] <- i num <- num + 1L } i <- i + 1L } 

(This code is not tested.)

If you can compute 1:bigBigBIGNumber, this will also work:

I assume that you want to call a function, and not simply tack on the indices themselves. Something like this may be closer to what you want:

values <- seq(bigBigBIGNumber) sapply(values[someCondition(values)], my_function) 

2 Comments

+1 for pointing out the potential value of vectorization in values[someCondition(values)]
The problem being, of course, that someCondition(i) is calculated twice. If this is a complex calculation, then doing it twice may eat up any performance gains. I have a question along these lines, any thoughts would be appreciated.
1

closer to the second one you listed:

 results <- list() for (i in ...) { ... results[[i]] <- ... } 

Note that i does not need to be an integer, could be a character, etc.

Also, you can use results[[length(results)]] <- ... if needed, but if you have an iterator already, probably wouldnt be.

10 Comments

Does this solve the two problems I asked about, i.e. 1. does it pre-generate all the values to iterate through (I don't want to keep all of them in memory) and 2. does appending make it of quadratic complexity, i.e. does appending as results[[i]] <- ... cause the whole list to be re-allocated?
Some benchmarking shows that it fails on both point 1. and point 2. from my comment. However, it also shows that this way of appending to a list is faster than the linked list approach I tried for up to a pretty large length (more than 100000), and also that looping in R this way is so slow that I'll usually "run out of time" before I'd run out of memory when using for.
If efficiency is of extreme concern, you probably want to look outside of base R. rcpp comes to mind. cran.r-project.org/web/packages/Rcpp/index.html
Any assignment will have "quadratic complexity" in R. There is always at least one temporary copy, sometimes more, made before the assignment is completed. The advice given is to have at least 3 times the RAM needed to hold the largest object in your workspace. (Consumed at a rate of roughly 10 bytes per 'double'.)
@DWin Do you mean that even x=1:10; x[[3]] <- 10 re-allocates the complete array? Surely it won't do this, as you suggested pre-allocation yourself in your other comment. The quadratic complexity is not of the assignment, but of repeated appending n times. (That will take O(n^2) time.)
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.