0

I have a table cluster (with more than one column):

head(cluster[,c('cuil_direccion')]) [1] "PJE INDEA 98 5 " [2] "PJE INDE 98 5 " [3] "B 34 VIV RECRE 57 00 " [4] "S CASA DE GO 600 " [5] "RCCA 958 00 o " [6] "JUAN B 1900 " 

I need to run a function that for each line extracts the numbers and paste them in a list. I'm using: str_extract_all. Since the table is huge I'd like to split data and use different cores for each split. I tried:

library(foreach) library(doParallel) registerDoParallel(cores=detectCores(all.tests=TRUE)) crea_tabla <- function(x){ xlst <- split(x, 1:nrow(x)) pred <- foreach(i = xlst, .combine = rbind) %dopar% { library(stringr) d<-data.frame(dir='a', E_numdir=1) j=1 DIR<-i$cuil_direccion[j] E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]] d<-rbind(d, data.frame( dir=DIR , E_numdir=toString(E_NUMDIR))) j=1+j } } 

then I ran

crea_tabla(cluster) 

And I get an empty result.

I'm not sure about the way doparallel uses data. E.G this part:

 library(stringr) d<-data.frame(dir='a', E_numdir=1) j=1 

Should I write before or after %dopar%?

EDITION

num_cores<-detectCores(all.tests=TRUE) registerDoParallel(cores=detectCores(all.tests=TRUE)) crea_tabla <- function(x, num_cores){ xlst <- split(x, 1:nrow(x)) j=1 d<-data.frame(dir='a', E_numdir=1) pred <- foreach(i = seq_along(xlst), .combine = rbind) %dopar% { print(i*num_cores/nrow(x)) library(stringr) DIR<-xlst[[i]]$cuil_direccion E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]] data.frame(dir=DIR , E_numdir=toString(E_NUMDIR)) } d <- rbind(d, pred) return(d) } a<-crea_tabla(cluster, num_cores) 

1 Answer 1

2

There are several things you need to make note of. First, you are correct to be suspicious of where you put initialized variables. You should declare them before the loop (no point in reloading the library several times). Second, you don't need the j variable. Just seq_along your list and index your list.

Next, regarding foreach, you have specified that the output will be rbind so you have not need to call rbind inside the loop. If you want that first row, you just rbind the results of the foreach loop to the initial data.frame. The following accomplishes what it appears you are trying to do.

Lastly, I assume you realize this, but make sure you set up your backend. I don't know which OS you are using but you would need to use another package like doParallel, doMC or doSNOW.

# recreate your data cluster <- read.table(header=F, text=' "PJE INDEA 98 5 " "PJE INDE 98 5 " "B 34 VIV RECRE 57 00 " "S CASA DE GO 600 " "RCCA 958 00 o " "JUAN B 1900 " ') colnames(cluster) <- 'cuil_direccion' library(stringr) library(foreach) crea_tabla <- function(x){ xlst <- split(x, 1:nrow(x)) j=1 d<-data.frame(dir='a', E_numdir=1) pred <- foreach(i = seq_along(xlst), .combine = rbind) %dopar% { DIR<-xlst[[i]]$cuil_direccion E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]] data.frame(dir=DIR , E_numdir=toString(E_NUMDIR)) } d <- rbind(d, pred) return(d) } crea_tabla(cluster) dir E_numdir 1 a 1 2 PJE INDEA 98 5 98, 5 3 PJE INDE 98 5 98, 5 4 B 34 VIV RECRE 57 00 34, 57, 00 5 S CASA DE GO 600 600 6 RCCA 958 00 o 958, 00 7 JUAN B 1900 1900 
Sign up to request clarification or add additional context in comments.

4 Comments

thanks! can I also add a print to know what % was done at each moment? Something like print(i* #cores/nrow(x)) ?
You can add another argument and pass the #cores and use the print statement. Just make sure you don't put it at the end of the foreach or it will try and rbind that % together instead of your data.frames!!!
thanks. But please see the edition,. I still don't get the print-
Ah yes, I forgot about the parallelization issue. This has been discussed several times on here. See this and this question. I am unaware of any general use solution that has been developed. I think you will be stuck with creating a log file if this is expected to take a long time.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.