2

I have below-mentioned data frame in R:

ID Unique_Id Date Status I-1 UR-112 2020-01-01 14:15:16 Approved I-2 UR-112 2020-02-12 14:15:16 In Process I-3 UR-112 2020-03-23 14:15:16 In Process I-4 UR-113 2020-01-01 14:15:16 Hold I-5 UR-113 2020-04-11 14:15:16 Hold I-6 UR-114 2020-04-07 14:15:16 Approved I-7 UR-114 2020-05-08 14:15:16 Approved I-8 UR-114 2020-05-09 14:15:16 In Process I-9 UR-115 2020-01-18 14:15:16 Approved I-10 UR-115 2020-03-23 14:15:16 Approved I-11 UR-116 2020-02-11 14:15:16 Approved 

I need to create a sub set of random 3 Unique_Id which is spread across all Date and these three Unique_Id must come under the available Status.

Required Output <-

ID Unique_Id Date Status I-1 UR-112 2020-01-01 14:15:16 Approved I-2 UR-112 2020-02-12 14:15:16 In Process I-3 UR-112 2020-03-23 14:15:16 In Process I-4 UR-113 2020-01-01 14:15:16 Hold I-5 UR-113 2020-04-11 14:15:16 Hold I-11 UR-116 2020-02-11 14:15:16 Approved 
5
  • Maybe: x[x$x$Unique_Id %in% sample(unique(x$Unique_Id), 3),] Commented Apr 14, 2021 at 10:33
  • @GKi- Thanks, I have tried this but it didn't cover the Status part. Commented Apr 14, 2021 at 10:35
  • What conditions should be considered with Status? Commented Apr 14, 2021 at 10:40
  • @GKi- All available unique values. Commented Apr 14, 2021 at 10:42
  • If there are three possible values of Status and you are limited to three random Unique_Ids and need each possible value of status to be represented at least once, then the only possible option is to select one Unique_Id for each value of Status. If there are more than three possible values of Status, then there is no solution. Or am I missing something? Commented Apr 14, 2021 at 10:57

2 Answers 2

2

Maybe using a loop like:

id <- character(0) while(length(id) != 3) { id <- character(0) for(i in unique(x$Status)) {id <- c(id, sample(setdiff(x$Unique_Id[x$Status == i], id), 1))} } x[x$Unique_Id %in% id,] # ID Unique_Id Date Status #4 I-4 UR-113 2020-01-01 14:15:16 Hold #5 I-5 UR-113 2020-04-11 14:15:16 Hold #6 I-6 UR-114 2020-04-07 14:15:16 Approved #7 I-7 UR-114 2020-05-08 14:15:16 Approved #8 I-8 UR-114 2020-05-09 14:15:16 In Process #9 I-9 UR-115 2020-01-18 14:15:16 Approved #10 I-10 UR-115 2020-03-23 14:15:16 Approved 

Data:

x <- structure(list(ID = c("I-1", "I-2", "I-3", "I-4", "I-5", "I-6", "I-7", "I-8", "I-9", "I-10", "I-11"), Unique_Id = c("UR-112", "UR-112", "UR-112", "UR-113", "UR-113", "UR-114", "UR-114", "UR-114", "UR-115", "UR-115", "UR-116"), Date = c("2020-01-01 14:15:16", "2020-02-12 14:15:16", "2020-03-23 14:15:16", "2020-01-01 14:15:16", "2020-04-11 14:15:16", "2020-04-07 14:15:16", "2020-05-08 14:15:16", "2020-05-09 14:15:16", "2020-01-18 14:15:16", "2020-03-23 14:15:16", "2020-02-11 14:15:16"), Status = c("Approved", "In Process", "In Process", "Hold", "Hold", "Approved", "Approved", "In Process", "Approved", "Approved", "Approved")), class = "data.frame", row.names = c(NA, -11L)) 
Sign up to request clarification or add additional context in comments.

2 Comments

It takes to much time when I'm running on 1 million dataset.
Is there a way to make the process fast?
0

Using GKi's data, here is another option:

setDT(x) x[Unique_Id %chin% x[sample(.N)][.(unique(Status)), on=.(Status), mult="first", Unique_Id] ] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.