How to create a dataframe with repeated columns created from randomly sampling another dataframe?

Question

I am trying to repeatedly add columns to a dataframe using random sampling from another dataframe.

My first dataframe with the actual data to be sampled from looks like this

df <- data.frame(cat = c("a", "b", "c","a", "b", "c"), x = c(6,23,675,1,78,543))

I have another dataframe like this:

df2 <- data.frame(obs =c(1,2,3,4,5,6,7,8,9,10), cat=c("a", "a", "a", "b", "b", "b", "c","c","c", "c"))

I want to add 1000 new columns to df2 that randomly samples from df, grouped by cat. I figure out a (probably very amateurish) way of doing this once, by using slice_sample() to make a new dataframe sample1 with a random sample of df, and then merging sample1 with df2.

df <- df %>% group_by(cat) df2 <- df2 %>% group_by(cat) sample1 <- slice_sample(df, preserve = T, n=3, replace = T ) sample1 <- sample1 %>% ungroup() %>% mutate(obs=c(1:9)) %>% select(-cat) df3 <- merge(df2,sample1, by= "obs")

Now, I want to find a way to repeat this 1000 times, to end up with df3 with 1000 columns (x1,x2,x3 etc.)

I have looked into repeat loops, but haven't been able to figure out how to make the above code work inside the loop.

I think you can wrap this in a function and use replicate(1000, call_your_fn) — akrun
– akrun, Commented Dec 22, 2020 at 23:59

ThomasIsCoding · Accepted Answer · 2020-12-23 00:48:40Z

Here is a data.table option that might help

dt <- as.data.table(df) dt2 <- as.data.table(df2) n <- 1000 res <- cbind( dt2[, .(obs)], dt2[ , replicate(n, sample(dt[.BY, x, on = "cat"], .N, replace = TRUE), simplify = FALSE), cat ] )

akrun · Accepted Answer · 2020-12-23 00:11:30Z

An option is to create a function and then use either replicate or rerun (from purrr) before doing the join

library(dplyr) library(purrr) library(stringr) f1 <- function(dat1) { dat1 %>% group_by(cat) %>% slice_sample(n = 3, replace = TRUE) %>% ungroup() %>% mutate(obs = row_number()) %>% select(-cat) } n <- 10 out <- rerun(10, f1(df)) %>% c(list(df2), .) %>% reduce(inner_join, by = 'obs') %>% rename_at(vars(starts_with('x')), ~ str_c('x', seq_along(.)))

Ronak Shah · Accepted Answer · 2020-12-23 02:21:55Z

You can keep only 3 X number of unique cat value rows in df2. Use replicate to repeat the sampling process n times and add n new columns.

library(dplyr) n <- 10 df2 <- df2 %>% slice(1:(3*n_distinct(cat))) df2[paste0('x', 1:n)] <- replicate(n, df %>% group_by(cat) %>% slice_sample(n = 3, replace = TRUE) %>% pull(x)) # obs cat x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 #1 1 a 6 1 1 6 6 1 1 1 6 6 #2 2 a 6 1 1 1 1 6 1 1 1 1 #3 3 a 1 6 1 6 1 6 6 1 6 6 #4 4 b 78 78 78 23 78 78 78 78 23 23 #5 5 b 78 78 78 23 23 23 78 78 78 23 #6 6 b 78 78 23 78 78 78 23 23 78 23 #7 7 c 675 543 543 543 543 543 675 543 543 675 #8 8 c 543 543 675 675 675 675 675 543 675 543 #9 9 c 543 543 675 543 675 543 675 675 543 675

Collectives™ on Stack Overflow

How to create a dataframe with repeated columns created from randomly sampling another dataframe?

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related