0

I am randomly sampling participants from an original data frame, then I would like to create new data frames, excluding one sample and keeping the rest (just note the dataframe is much larger with more variables and more observations for each id).

Sample df:

id var1 var2 1 10 15 1 10 15 2 11 4 2 11 4 3 12 4 3 12 4 4 9 10 4 9 10 #randomly sample two sets of id id <- as.numeric(as.character(df$id)) fold1 <- as.data.frame(sample(id, 2, replace=TRUE)) colnames(fold1) <- "id" fold2 <- as.data.frame(sample(id, 2, replace=TRUE)) colnames(fold2) <- "id" 

Desired output

df.new1:

id var1 var2 2 11 4 2 11 4 3 12 4 3 12 4 

df.new2:

id var1 var2 1 10 15 1 10 15 4 9 10 4 9 10 

I tried something along these lines, but there seems to be some issue with my syntax I can't quite figure out. If there's a dplyr implementation I would be really happy to see it.

list = c(fold1, fold2) for(i in length(list)) { df.new <- as.data.frame(df[!(df$id %in% list[i]$id), ]) assign(paste("df.new", i, sep="."), df.new) } 

**Edit: I slightly modified the example to reflect the fact that each draw should sample a proportion of the total number of id's and in total the number of id's sampled should equal the total number of id's in the df. So if there are 4 id's, each draw should contain 2 id's.

4
  • dplyr has the methods sample_n (sample n rows) and sample_frac (sample a proportion of rows). Do they help? Commented Jun 13, 2017 at 3:00
  • I tried group_by(id) with sample_n, but it didn't seem like it was sampling based on a random draw of id. But maybe there's a way to specify how it draws? Commented Jun 13, 2017 at 3:04
  • How many such dataframes do you need ? All possible combinations? Also just to confirm, you need to ignore one id at a time? Commented Jun 13, 2017 at 3:04
  • I need to draw a proportion of the id's (sorry I made the example too short). So if I had 60 id's and wanted 5 draws, I would have 12 id's in each fold and 5 folds. Commented Jun 13, 2017 at 3:11

1 Answer 1

1

Example if you have a sample data, with 60 id each with one value:

df <- data.frame(id=1:60, val=sample(rep(letters, 3), 60)) 

To get the id for 5 subset data, each with 12 ids:

set.seed(1) draw <- sample(1:60, 60, replace=FALSE) id <- split(draw, rep(1:5, each=12)) 

Using lapply to subset based on the id:

output <- lapply(id, function(x)df[df$id %in% x, ]) #e.g. output[[1]] # id val # 4 4 y # 9 9 f # 11 11 x # 12 12 e # 16 16 o # 22 22 o # 33 33 d # 34 34 n # 36 36 r # 50 50 s # 52 52 p # 57 57 p 
Sign up to request clarification or add additional context in comments.

4 Comments

This is great in that it gets the 5 samples, but at the very top of the post I mention that I actually want to exclude each subset draw from the original df (so the current output contains exactly the 5 dfs that I want to exclude from the original df). Does that make sense?
So you wanted only 2 data.frames, one with 5 id and another with 55 id?
5 data frames with 48 ids each (so taking the 12 randomly selected out)
lapply(id, function(x) df[!df$id %in% x, ])

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.