Create new data frame from random samples of original data

Question

I am randomly sampling participants from an original data frame, then I would like to create new data frames, excluding one sample and keeping the rest (just note the dataframe is much larger with more variables and more observations for each id).

Sample df:

id var1 var2 1 10 15 1 10 15 2 11 4 2 11 4 3 12 4 3 12 4 4 9 10 4 9 10 #randomly sample two sets of id id <- as.numeric(as.character(df$id)) fold1 <- as.data.frame(sample(id, 2, replace=TRUE)) colnames(fold1) <- "id" fold2 <- as.data.frame(sample(id, 2, replace=TRUE)) colnames(fold2) <- "id"

Desired output

df.new1:

id var1 var2 2 11 4 2 11 4 3 12 4 3 12 4

df.new2:

id var1 var2 1 10 15 1 10 15 4 9 10 4 9 10

I tried something along these lines, but there seems to be some issue with my syntax I can't quite figure out. If there's a dplyr implementation I would be really happy to see it.

list = c(fold1, fold2) for(i in length(list)) { df.new <- as.data.frame(df[!(df$id %in% list[i]$id), ]) assign(paste("df.new", i, sep="."), df.new) }

**Edit: I slightly modified the example to reflect the fact that each draw should sample a proportion of the total number of id's and in total the number of id's sampled should equal the total number of id's in the df. So if there are 4 id's, each draw should contain 2 id's.

dplyr has the methods sample_n (sample n rows) and sample_frac (sample a proportion of rows). Do they help? — neilfws
– neilfws, Commented Jun 13, 2017 at 3:00
I tried group_by(id) with sample_n, but it didn't seem like it was sampling based on a random draw of id. But maybe there's a way to specify how it draws? — Mik
– Mik, Commented Jun 13, 2017 at 3:04
How many such dataframes do you need ? All possible combinations? Also just to confirm, you need to ignore one id at a time? — Ronak Shah
– Ronak Shah, Commented Jun 13, 2017 at 3:04
I need to draw a proportion of the id's (sorry I made the example too short). So if I had 60 id's and wanted 5 draws, I would have 12 id's in each fold and 5 folds. — Mik
– Mik, Commented Jun 13, 2017 at 3:11

Adam Quek · Accepted Answer · 2017-06-13 03:39:35Z

1

Example if you have a sample data, with 60 id each with one value:

df <- data.frame(id=1:60, val=sample(rep(letters, 3), 60))

To get the id for 5 subset data, each with 12 ids:

set.seed(1) draw <- sample(1:60, 60, replace=FALSE) id <- split(draw, rep(1:5, each=12))

Using lapply to subset based on the id:

output <- lapply(id, function(x)df[df$id %in% x, ]) #e.g. output[[1]] # id val # 4 4 y # 9 9 f # 11 11 x # 12 12 e # 16 16 o # 22 22 o # 33 33 d # 34 34 n # 36 36 r # 50 50 s # 52 52 p # 57 57 p

edited Jun 13, 2017 at 3:39

answered Jun 13, 2017 at 3:25

Adam Quek

7,1831 gold badge20 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mik Over a year ago

This is great in that it gets the 5 samples, but at the very top of the post I mention that I actually want to exclude each subset draw from the original df (so the current output contains exactly the 5 dfs that I want to exclude from the original df). Does that make sense?

Adam Quek Over a year ago

So you wanted only 2 data.frames, one with 5 id and another with 55 id?

Mik Over a year ago

5 data frames with 48 ids each (so taking the 12 randomly selected out)

Adam Quek Over a year ago

lapply(id, function(x) df[!df$id %in% x, ])

Collectives™ on Stack Overflow

Create new data frame from random samples of original data

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related