2

I want to extract n rows randomly from a data frame in function of one column. So with this example :

# Reproducible example df <- as.data.frame(matrix(0,2e+6,2)) df$V1 <- runif(nrow(df),0,1) df$V2 <- sample(c(1:10),nrow(df), replace=TRUE) df$V3 <- sample(c("A","B","C"),nrow(df), replace=TRUE) 

I want to extract, for example, n=10rows for each value of V2.

# Example of what I need with one value of V2 df1 <- df[which(df$V2==1),] str(df1) df1[sample(1:nrow(df1),10),] 

I do not want to do any for-loopso I tried this line with tapply:

df_objective <- tapply(df$V1, df$V2, function(x) df[sample(1:nrow(df),10),"V2"]) 

which is close to what I want but I lost the third column of the data frame.

I tried this to have complete subsets :

df_objective <- by(cbind(df$V1,df$V3), df$V2, function(x) df[sample(1:nrow(df),10),"V2"]) 

but it does not help.

How can I keep all the columns in the subsets ?

0

3 Answers 3

2

It sounds like you're just looking for something like sample_n from "dplyr":

library(dplyr) df %>% group_by(V2) %>% sample_n(10) # Source: local data frame [100 x 3] # Groups: V2 # # V1 V2 V3 # 1 0.51099392 1 B # 2 0.87098866 1 A # 3 0.13647752 1 B # 4 0.15348834 1 B # 5 0.94096127 1 B # 6 0.05673849 1 A # 7 0.69960842 1 C # 8 0.02246671 1 C # 9 0.88903430 1 B # 10 0.52128253 1 A # .. ... .. .. 

Alternatively, there's stratified from my "splitstackshape" package.

library(splitstackshape) stratified(df, "V2", 10) 
Sign up to request clarification or add additional context in comments.

3 Comments

That's pretty concise, there, though sampling isn't the first thing that comes to mind when I see the word "stratified"
@Frank, it's as in "stratified sampling". What do you think of when you see "stratified"?
Dunno, hardly ever see it. I guess "hierarchical models"
2

You can try

library(data.table) setDT(df)[, .SD[sample(.N, 10)] , V2] 

Or a faster option as suggested by @Frank

setDT(df)[df[,sample(.I,10),V2]$V1] 

4 Comments

Obligatory alternative: setDT(df); df[df[,sample(.I,10),V2]$V1]
Ok thank you, the second line works, but the V2 seems to be randomly ordered.
@user3443183 If you write keyby=V2 in place of V2 it should be ordered. They currently show up in order of first appearance.
Ok, it is ordered now.
1

You want to sample from the rows, so that should be the first arg to tapply, not V1:

myrows <- unlist(tapply(1:nrow(df),df$V2,sample,size=10)) df1[myrows,] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.