How to create sub set of data with equal random distribution in R

Question

I have below-mentioned data frame in R:

ID Unique_Id Date Status I-1 UR-112 2020-01-01 14:15:16 Approved I-2 UR-112 2020-02-12 14:15:16 In Process I-3 UR-112 2020-03-23 14:15:16 In Process I-4 UR-113 2020-01-01 14:15:16 Hold I-5 UR-113 2020-04-11 14:15:16 Hold I-6 UR-114 2020-04-07 14:15:16 Approved I-7 UR-114 2020-05-08 14:15:16 Approved I-8 UR-114 2020-05-09 14:15:16 In Process I-9 UR-115 2020-01-18 14:15:16 Approved I-10 UR-115 2020-03-23 14:15:16 Approved I-11 UR-116 2020-02-11 14:15:16 Approved

I need to create a sub set of random 3 Unique_Id which is spread across all Date and these three Unique_Id must come under the available Status.

Required Output <-

ID Unique_Id Date Status I-1 UR-112 2020-01-01 14:15:16 Approved I-2 UR-112 2020-02-12 14:15:16 In Process I-3 UR-112 2020-03-23 14:15:16 In Process I-4 UR-113 2020-01-01 14:15:16 Hold I-5 UR-113 2020-04-11 14:15:16 Hold I-11 UR-116 2020-02-11 14:15:16 Approved

Maybe: x[x$x$Unique_Id %in% sample(unique(x$Unique_Id), 3),] — GKi
– GKi, Commented Apr 14, 2021 at 10:33
@GKi- Thanks, I have tried this but it didn't cover the Status part. — Sophia Wilson
– Sophia Wilson, Commented Apr 14, 2021 at 10:35
If there are three possible values of Status and you are limited to three random Unique_Ids and need each possible value of status to be represented at least once, then the only possible option is to select one Unique_Id for each value of Status. If there are more than three possible values of Status, then there is no solution. Or am I missing something? — Limey
– Limey, Commented Apr 14, 2021 at 10:57

GKi · Accepted Answer · 2021-04-14 11:12:59Z

Maybe using a loop like:

id <- character(0) while(length(id) != 3) { id <- character(0) for(i in unique(x$Status)) {id <- c(id, sample(setdiff(x$Unique_Id[x$Status == i], id), 1))} } x[x$Unique_Id %in% id,] # ID Unique_Id Date Status #4 I-4 UR-113 2020-01-01 14:15:16 Hold #5 I-5 UR-113 2020-04-11 14:15:16 Hold #6 I-6 UR-114 2020-04-07 14:15:16 Approved #7 I-7 UR-114 2020-05-08 14:15:16 Approved #8 I-8 UR-114 2020-05-09 14:15:16 In Process #9 I-9 UR-115 2020-01-18 14:15:16 Approved #10 I-10 UR-115 2020-03-23 14:15:16 Approved

Data:

x <- structure(list(ID = c("I-1", "I-2", "I-3", "I-4", "I-5", "I-6", "I-7", "I-8", "I-9", "I-10", "I-11"), Unique_Id = c("UR-112", "UR-112", "UR-112", "UR-113", "UR-113", "UR-114", "UR-114", "UR-114", "UR-115", "UR-115", "UR-116"), Date = c("2020-01-01 14:15:16", "2020-02-12 14:15:16", "2020-03-23 14:15:16", "2020-01-01 14:15:16", "2020-04-11 14:15:16", "2020-04-07 14:15:16", "2020-05-08 14:15:16", "2020-05-09 14:15:16", "2020-01-18 14:15:16", "2020-03-23 14:15:16", "2020-02-11 14:15:16"), Status = c("Approved", "In Process", "In Process", "Hold", "Hold", "Approved", "Approved", "In Process", "Approved", "Approved", "Approved")), class = "data.frame", row.names = c(NA, -11L))

It takes to much time when I'm running on 1 million dataset.

chinsoon12 · Accepted Answer · 2021-04-14 23:33:47Z

Using GKi's data, here is another option:

setDT(x) x[Unique_Id %chin% x[sample(.N)][.(unique(Status)), on=.(Status), mult="first", Unique_Id] ]

Collectives™ on Stack Overflow

How to create sub set of data with equal random distribution in R

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related