R: Stratified random sample proportion of unique ID's by grouping variable

Question

With the following sample dataframe I would like to draw a stratified random sample (e.g., 40%) of the ID's "ID" from each level of the factor "Cohort":

data<-structure(list(Cohort = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), ID = structure(1:20, .Label = c("a1 ", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9", "b10", "b11", "b12", "b13", "b14", "b15", "b16", "b17", "b18", "b19", "b20" ), class = "factor")), .Names = c("Cohort", "ID"), class = "data.frame", row.names = c(NA, -20L))

I only know how to draw a random number of rows using the following:

library(dplyr) data %>% group_by(Cohort) %>% sample_n(size = 10)

But my actual data are longitudinal so I have multiple cases of the same ID within each cohort and several cohorts of different sizes, thus the need to select a proportion of unique ID's. Any assistance would be appreciated.

You should provide data that reproduce the problem you have, otherwise we cannot understand it... so if you have multiple IDs, please produce data with this feature ;) — Arthur
– Arthur, Commented Nov 21, 2015 at 0:48

eipi10 · Accepted Answer · 2015-11-21 00:51:38Z

Here's one way:

data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID)))))

This will return all rows containing the randomly sampled IDs. In other words, I'm assuming you have measurements that go with each row and that you want all the measurements for each sampled ID. (If you just want one row returned for each sampled ID then @bramtayl's answer will do that.)

For example:

data = data.frame(rbind(data, data), value=rnorm(2*nrow(data))) data %>% group_by(Cohort) %>% filter(ID %in% sample(unique(ID), ceiling(0.4*length(unique(ID))))) Cohort ID value (int) (fctr) (dbl) 1 1 a1 -0.92370760 2 1 a2 -0.37230655 3 1 a3 -1.27037502 4 1 a7 -0.34545295 5 2 b14 -2.08205561 6 2 b17 0.31393998 7 2 b18 -0.02250819 8 2 b19 0.53065857 9 2 b20 0.03924414 10 1 a1 -0.08275011 11 1 a2 -0.10036822 12 1 a3 1.42397042 13 1 a7 -0.35203237 14 2 b14 0.30422865 15 2 b17 -1.82008014 16 2 b18 1.67548568 17 2 b19 0.74324596 18 2 b20 0.27725794

bramtayl · Accepted Answer · 2015-11-23 00:24:15Z

Why not

library(dplyr) data %>% select(ID, Cohort) %>% distinct %>% group_by(Cohort) %>% sample_frac(0.4) %>% left_join(data)

Collectives™ on Stack Overflow

R: Stratified random sample proportion of unique ID's by grouping variable

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related