Random subsets in function of one column in r

Question

I want to extract n rows randomly from a data frame in function of one column. So with this example :

# Reproducible example df <- as.data.frame(matrix(0,2e+6,2)) df$V1 <- runif(nrow(df),0,1) df$V2 <- sample(c(1:10),nrow(df), replace=TRUE) df$V3 <- sample(c("A","B","C"),nrow(df), replace=TRUE)

I want to extract, for example, n=10rows for each value of V2.

# Example of what I need with one value of V2 df1 <- df[which(df$V2==1),] str(df1) df1[sample(1:nrow(df1),10),]

I do not want to do any for-loopso I tried this line with tapply:

df_objective <- tapply(df$V1, df$V2, function(x) df[sample(1:nrow(df),10),"V2"])

which is close to what I want but I lost the third column of the data frame.

I tried this to have complete subsets :

df_objective <- by(cbind(df$V1,df$V3), df$V2, function(x) df[sample(1:nrow(df),10),"V2"])

but it does not help.

How can I keep all the columns in the subsets ?

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-05-12 15:48:57Z

It sounds like you're just looking for something like sample_n from "dplyr":

library(dplyr) df %>% group_by(V2) %>% sample_n(10) # Source: local data frame [100 x 3] # Groups: V2 # # V1 V2 V3 # 1 0.51099392 1 B # 2 0.87098866 1 A # 3 0.13647752 1 B # 4 0.15348834 1 B # 5 0.94096127 1 B # 6 0.05673849 1 A # 7 0.69960842 1 C # 8 0.02246671 1 C # 9 0.88903430 1 B # 10 0.52128253 1 A # .. ... .. ..

Alternatively, there's stratified from my "splitstackshape" package.

library(splitstackshape) stratified(df, "V2", 10)

That's pretty concise, there, though sampling isn't the first thing that comes to mind when I see the word "stratified"
@Frank, it's as in "stratified sampling". What do you think of when you see "stratified"?

akrun · Accepted Answer · 2015-05-12 15:52:13Z

2

You can try

library(data.table) setDT(df)[, .SD[sample(.N, 10)] , V2]

Or a faster option as suggested by @Frank

setDT(df)[df[,sample(.I,10),V2]$V1]

edited May 12, 2015 at 15:52

answered May 12, 2015 at 15:49

akrun

891k38 gold badges590 silver badges700 bronze badges

4 Comments

Frank Over a year ago

Obligatory alternative: setDT(df); df[df[,sample(.I,10),V2]$V1]

user3443183 Over a year ago

Ok thank you, the second line works, but the V2 seems to be randomly ordered.

Frank Over a year ago

@user3443183 If you write keyby=V2 in place of V2 it should be ordered. They currently show up in order of first appearance.

user3443183 Over a year ago

Ok, it is ordered now.

Frank · Accepted Answer · 2015-05-12 15:49:16Z

You want to sample from the rows, so that should be the first arg to tapply, not V1:

myrows <- unlist(tapply(1:nrow(df),df$V2,sample,size=10)) df1[myrows,]

Collectives™ on Stack Overflow

Random subsets in function of one column in r

3 Answers 3

3 Comments

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

4 Comments

Comments

Related