R (Stratified) Random Sampling for Defined Cases

Question

I have a data frame:

DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK", "KL", "LM"), ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1))

My question: I would like to create a new column that includes a (binary) random number ('0' or '1') for cases 'ID' == 1 with a fixed proportion (or pre-defined prevalence) (e.g., random numbers '0' x 2 and '1' x 4).

EDIT I: For non-case specific purposes, the solution might be:

DF$RANDOM[sample(1:nrow(DF), nrow(DF), FALSE)] <- rep(RANDOM, c(nrow(DF)-4,4))

But, I still need the cas-specific assignment AND the aforementioned solution does not explicitly refer to '0' or '1'.

(Note: The variable 'value' is not relevant for the question; only an identifier.)

I figured out relevant posts on stratified sampling or random row selection - but this question is not covered by those (and other) posts.

Thank you VERY much in advance.

YOLO · Accepted Answer · 2018-03-04 19:04:08Z

1

You can subset the data first by case ID == 1. To ensure occurrence of 1s and 0s, we use rep function and set replace to False in sample function.
Here's a solution.

library(data.table) set.seed(121) DF[ID == 1, new_column := sample(rep(c(0,1), c(2,4)), .N, replace = F)] print(DF1) Value ID new_column 1: AB 1 1 2: BC 0 NA 3: CD 0 NA 4: DE 1 1 5: EF 0 NA 6: FG 1 1 7: GH 1 1 8: HI 0 NA 9: IJ 0 NA 10: JK 1 0 11: KL 0 NA 12: LM 1 0

edited Mar 4, 2018 at 19:04

answered Mar 4, 2018 at 18:25

YOLO

22k5 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Dan Over a year ago

Yes, I know. BUT: I require a solution without a prior subset AND you worked with probabilities (I need fixed proportions - no probabilities).

YOLO Over a year ago

I've made few changes. After doing set.seed, the prob parameter will always generate the same number of 1s and 0s. Since you want to generate 1 and 0 randomly, such that 1 occurs 4 times and 0 should occur 2 times, that's how it's going to work.

Dan Over a year ago

Thanks. But what about the 'prob' part (you computed the proportion of the provided numbers). But I would like to have the fix numbers. Finally, when you create an additional column (e.g., DF_1[ID == 1, new_column_additional := sample(c(0,1), .N, replace = T, prob = c(0.33,0.67))] ) the result is wrong.

YOLO Over a year ago

I've added the fix by using rep instead of probabilities. Please check.

Dan Over a year ago

GREAT. This works as anticipated. Thanks for your fast replies and your help. Good solution.

|

Yiran Wang · Accepted Answer · 2018-03-04 19:22:42Z

0

library(dplyr) DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK", "KL", "LM"), ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1), stringsAsFactors = FALSE) DF %>% group_by(ID) %>% sample_n(4, replace = FALSE)

edited Mar 4, 2018 at 19:22

answered Mar 4, 2018 at 18:42

Yiran Wang

11 silver badge2 bronze badges

3 Comments

Dan Over a year ago

Your example does not work since you do not refer to the defined cases.

Yiran Wang Over a year ago

I edited my answer. Is this what you were looking for?

Dan Over a year ago

Thank. No, since your example is not case-specific.

Collectives™ on Stack Overflow

R (Stratified) Random Sampling for Defined Cases

2 Answers 2

6 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Linked

Related