6

With the following sample dataframe I would like to draw a stratified random sample (e.g., 40%) of the ID's "ID" from each level of the factor "Cohort":

data<-structure(list(Cohort = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), ID = structure(1:20, .Label = c("a1 ", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "b10", "b11", "b12", "b13", "b14", "b15", "b16", "b17", "b18", "b19", "b20" ), class = "factor")), .Names = c("Cohort", "ID"), class = "data.frame", row.names = c(NA, -20L)) 

I only know how to draw a random number of rows using the following:

library(dplyr) data %>% group_by(Cohort) %>% sample_n(size = 10) 

But my actual data are longitudinal so I have multiple cases of the same ID within each cohort and several cohorts of different sizes, thus the need to select a proportion of unique ID's. Any assistance would be appreciated.

1
  • You should provide data that reproduce the problem you have, otherwise we cannot understand it... so if you have multiple IDs, please produce data with this feature ;) Commented Nov 21, 2015 at 0:48

2 Answers 2

8

Here's one way:

data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) 

This will return all rows containing the randomly sampled IDs. In other words, I'm assuming you have measurements that go with each row and that you want all the measurements for each sampled ID. (If you just want one row returned for each sampled ID then @bramtayl's answer will do that.)

For example:

data = data.frame(rbind(data, data), value=rnorm(2*nrow(data))) data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) Cohort ID value (int) (fctr) (dbl) 1 1 a1 -0.92370760 2 1 a2 -0.37230655 3 1 a3 -1.27037502 4 1 a7 -0.34545295 5 2 b14 -2.08205561 6 2 b17 0.31393998 7 2 b18 -0.02250819 8 2 b19 0.53065857 9 2 b20 0.03924414 10 1 a1 -0.08275011 11 1 a2 -0.10036822 12 1 a3 1.42397042 13 1 a7 -0.35203237 14 2 b14 0.30422865 15 2 b17 -1.82008014 16 2 b18 1.67548568 17 2 b19 0.74324596 18 2 b20 0.27725794 
Sign up to request clarification or add additional context in comments.

Comments

5

Why not

library(dplyr) data %>% select(ID, Cohort) %>% distinct %>% group_by(Cohort) %>% sample_frac(0.4) %>% left_join(data) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.