0

I am using the R programming language. I am trying to learn how to add "progress bars" to estimate how much time is remaining while a function is running (https://www.rdocumentation.org/packages/progress/versions/1.2.2/topics/progress_bar).

For example:

library(progress) pb <- progress_bar$new(total = 100) for (i in 1:100) { pb$tick() Sys.sleep(1 / 100) } 

enter image description here

Suppose I have a function called "grid_function" and a dataset called "DF_1". I am taking each individual row from "DF_1" and feeding this row into "grid_function". "grid_function" performs some calculations using this row, and stores it into a "list" called "resultdf1". Finally, "resultdf1" is converted into a data frame called "final_output". The "feeding process" can be seen below:

resultdf1 <- apply(DF_1,1, # 1 means rows FUN=function(x){ do.call( # Call Function grid_function with the arguments in # a list grid_function, # force list type for the arguments c(list(train_data_new), as.list( # make the row to a named vector unlist(x) ) )) } ) l = resultdf1 final_output = rbindlist(l, fill = TRUE) 

Question: I would like to add a "progress bar" to the above code.

What I tried: I tried to do this as follows:

library(doParallel) library(future) cl <- makePSOCKcluster(6) # 6 cpu cores out of 8 registerDoParallel(cl) pb <- progress_bar$new(total = 100) for (i in 1:100) { resultdf1 <- apply(DF_1,1, # 1 means rows FUN=function(x){ do.call( # Call Function grid_function2 with the arguments in # a list grid_function, # force list type for the arguments c(list(train_data_new), as.list( # make the row to a named vector unlist(x) ) )) } ) l = resultdf1 final_output = rbindlist(l, fill = TRUE) pb$tick() Sys.sleep(1 / 100) } stopCluster(cl) 

This appears to be working, but I am not sure if I did everything correctly. Can someone please tell me if I have done this correctly? Is there any chance that adding this "progress bar" will actually result in the function taking more time to run?

enter image description here

Thanks

UPDATE: @Serkan : Here is the full code:

library(dplyr) library(data.table) results_table <- data.frame() grid_function <- function(train_data, random_1, random_2, random_3, random_4, split_1, split_2, split_3) { #bin data according to random criteria train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c"))) train_data$cat = as.factor(train_data$cat) #new splits a_table = train_data %>% filter(cat == "a") %>% select(a1, b1, c1, cat) b_table = train_data %>% filter(cat == "b") %>% select(a1, b1, c1, cat) c_table = train_data %>% filter(cat == "c") %>% select(a1, b1, c1, cat) #calculate random quantile ("quant") for each bin table_a = data.frame(a_table%>% group_by(cat) %>% mutate(quant = quantile(c1, prob = split_1))) table_b = data.frame(b_table%>% group_by(cat) %>% mutate(quant = quantile(c1, prob = split_2))) table_c = data.frame(c_table%>% group_by(cat) %>% mutate(quant = quantile(c1, prob = split_3))) #create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1" table_a$diff = ifelse(table_a$quant > table_a$c1,1,0) table_b$diff = ifelse(table_b$quant > table_b$c1,1,0) table_c$diff = ifelse(table_c$quant > table_c$c1,1,0) #group all tables final_table = rbind(table_a, table_b, table_c) #create a table: for each bin, calculate the average of "diff" final_table_2 = data.frame(final_table %>% group_by(cat) %>% summarize( mean = mean(diff) )) #add "total mean" to this table final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff))) #format this table: add the random criteria to this table for reference final_table_2$random_1 = random_1 final_table_2$random_2 = random_2 final_table_2$random_3 = random_3 final_table_2$random_4 = random_4 final_table_2$split_1 = split_1 final_table_2$split_2 = split_2 final_table_2$split_3 = split_3 results_table <- rbind(results_table, final_table_2) final_results = dcast(setDT(results_table), random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean') } # create some data for this example a1 = rnorm(1000,100,10) b1 = rnorm(1000,100,5) c1 = sample.int(1000, 1000, replace = TRUE) train_data = data.frame(a1,b1,c1) #grid random_1 <- seq(80,100,5) random_2 <- seq(85,120,5) random_3 <- seq(85,120,5) random_4 <- seq(90,120,5) split_1 = seq(0,1,0.1) split_2 = seq(0,1,0.1) split_3 = seq(0,1,0.1) DF_1 <- expand.grid(random_1 , random_2, random_3, random_4, split_1, split_2, split_3) #reduce the size of the grid for this example DF_1 = DF_1[1:100,] colnames(DF_1) <- c("random_1" , "random_2", "random_3", "random_4", "split_1", "split_2", "split_3") train_data_new <- copy(train_data) resultdf1 <- apply(DF_1,1, # 1 means rows FUN=function(x){ do.call( # Call Function grid_function2 with the arguments in # a list grid_function, # force list type for the arguments c(list(train_data_new), as.list( # make the row to a named vector unlist(x) ) )) } ) l = resultdf1 final_output = rbindlist(l, fill = TRUE) 

And here is how the final output should look like:

head(final_output) random_1 random_2 random_3 random_4 split_1 split_2 split_3 b c total a 1: 80 85 85 90 0 0 0 0 0 0 NA 2: 85 85 85 90 0 0 0 0 0 0 NA 3: 90 85 85 90 0 0 0 0 0 0 NA 4: 95 85 85 90 0 0 0 0 0 0 NA 5: 100 85 85 90 0 0 0 0 0 0 NA 6: 80 90 85 90 0 0 0 0 0 0 NA 
2
  • 2
    Is there any reason to believe you've done this incorrectly? "Does this look correct" isn't really a specific programming question that's a good fit for Stack Overflow. Either your code does what you want or it does not. If you are concerned with performance, start by profiling your own code: adv-r.hadley.nz/perf-measure.html. There are lots of ways to time code to see how long it's taking. You should be able to answer your own question about speed that way. Commented Jul 30, 2021 at 23:29
  • @ MrFlick : thank you for your reply! I am still learning about programming - many times my code will run without errors, but I am still very skeptical that it might not be running as intended. thanks again! Commented Jul 31, 2021 at 3:18

1 Answer 1

3

I would opt for .progress = TRUE in furrr:future_map, an example R Script below;

library(furrr) library(magrittr) plan(multisession, workers = 2) 1:10 %>% future_map( ~ Sys.sleep(2), .progress = TRUE ) 

It adds a progress-bar automatically,

Progress: ─────────────────────────────────────── 100% 

Without your data.frame and specific function, I cant replicate your problem to show you how it is implemented, but the apply-family are easily translated to the map-family of functions.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you so much for your answer! I updated my question and included the full code start-to-finish. can you please show me how to implement this answer on my code? I am not sure where I would insert my code in your answer! thank you so much!
I dont quite see it? Remember to include your desired output, and only a subset of you data. No need to run your full R Script
thank you for your reply! please check again. thank you so much!
@ Serkan: did you get a chance to look at it?
Yes I did, but your code is too long - so I kinda lost track. See this stackoverflow.com/questions/68479928/…. Here I show how for-loop can be replaced by map. If it doesnt answer your question, then reduce your problem and mimick it, then I can show you as well :-)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.