0

I am trying to repeatedly add columns to a dataframe using random sampling from another dataframe.

My first dataframe with the actual data to be sampled from looks like this

df <- data.frame(cat = c("a", "b", "c","a", "b", "c"), x = c(6,23,675,1,78,543)) 

I have another dataframe like this:

df2 <- data.frame(obs =c(1,2,3,4,5,6,7,8,9,10), cat=c("a", "a", "a", "b", "b", "b", "c","c","c", "c")) 

I want to add 1000 new columns to df2 that randomly samples from df, grouped by cat. I figure out a (probably very amateurish) way of doing this once, by using slice_sample() to make a new dataframe sample1 with a random sample of df, and then merging sample1 with df2.

df <- df %>% group_by(cat) df2 <- df2 %>% group_by(cat) sample1 <- slice_sample(df, preserve = T, n=3, replace = T ) sample1 <- sample1 %>% ungroup() %>% mutate(obs=c(1:9)) %>% select(-cat) df3 <- merge(df2,sample1, by= "obs") 

Now, I want to find a way to repeat this 1000 times, to end up with df3 with 1000 columns (x1,x2,x3 etc.)

I have looked into repeat loops, but haven't been able to figure out how to make the above code work inside the loop.

1
  • I think you can wrap this in a function and use replicate(1000, call_your_fn) Commented Dec 22, 2020 at 23:59

3 Answers 3

1

Here is a data.table option that might help

dt <- as.data.table(df) dt2 <- as.data.table(df2) n <- 1000 res <- cbind( dt2[, .(obs)], dt2[ , replicate(n, sample(dt[.BY, x, on = "cat"], .N, replace = TRUE), simplify = FALSE), cat ] ) 
Sign up to request clarification or add additional context in comments.

Comments

0

An option is to create a function and then use either replicate or rerun (from purrr) before doing the join

library(dplyr) library(purrr) library(stringr) f1 <- function(dat1) { dat1 %>% group_by(cat) %>% slice_sample(n = 3, replace = TRUE) %>% ungroup() %>% mutate(obs = row_number()) %>% select(-cat) } n <- 10 out <- rerun(10, f1(df)) %>% c(list(df2), .) %>% reduce(inner_join, by = 'obs') %>% rename_at(vars(starts_with('x')), ~ str_c('x', seq_along(.))) 

Comments

0

You can keep only 3 X number of unique cat value rows in df2. Use replicate to repeat the sampling process n times and add n new columns.

library(dplyr) n <- 10 df2 <- df2 %>% slice(1:(3*n_distinct(cat))) df2[paste0('x', 1:n)] <- replicate(n, df %>% group_by(cat) %>% slice_sample(n = 3, replace = TRUE) %>% pull(x)) # obs cat x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 #1 1 a 6 1 1 6 6 1 1 1 6 6 #2 2 a 6 1 1 1 1 6 1 1 1 1 #3 3 a 1 6 1 6 1 6 6 1 6 6 #4 4 b 78 78 78 23 78 78 78 78 23 23 #5 5 b 78 78 78 23 23 23 78 78 78 23 #6 6 b 78 78 23 78 78 78 23 23 78 23 #7 7 c 675 543 543 543 543 543 675 543 543 675 #8 8 c 543 543 675 675 675 675 675 543 675 543 #9 9 c 543 543 675 543 675 543 675 675 543 675 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.