456

I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?

0

14 Answers 14

569
Answer recommended by R Language Collective

First make some data:

> df = data.frame(matrix(rnorm(20), nrow=10)) > df X1 X2 1 0.7091409 -1.4061361 2 -1.1334614 -0.1973846 3 2.3343391 -0.4385071 4 -0.9040278 -0.6593677 5 0.4180331 -1.2592415 6 0.7572246 -0.5463655 7 -0.8996483 0.4231117 8 -1.0356774 -0.1640883 9 -0.3983045 0.7157506 10 -0.9060305 2.3234110 

Then select some rows at random:

> df[sample(nrow(df), 3), ] X1 X2 9 -0.3983045 0.7157506 2 -1.1334614 -0.1973846 10 -0.9060305 2.3234110 
Sign up to request clarification or add additional context in comments.

12 Comments

@nikhil See here and here for starters. You can also type ?sample in the R console to read about that function.
Can someone explain why sample(df,3) does not work? Why do you need df[sample(nrow(df), 3), ]?
@stackoverflowuser2010, you can type ?sample and see that the first argument in the sample function must be a vector or a positive integer. I don't think a data.frame works as a vector in this case.
Remember to set your seed (e.g. set.seed(42) ) every time you want to reproduce that specific sample.
sample.int would be slightly faster I believe: library(microbenchmark);microbenchmark( sample( 10000, 100 ), sample.int( 10000, 100 ), times = 10000 )
|
308

The answer John Colby gives is the right answer. However if you are a dplyr user there is also the answer sample_n:

sample_n(df, 10) 

randomly samples 10 rows from the dataframe. It calls sample.int, so really is the same answer with less typing (and simplifies use in the context of magrittr since the dataframe is the first argument).

2 Comments

As of dplyr 1.0.0, sample_n (and sample_frac) have been superseded by slice_sample, though they remain for now.
This appears to sample without replacement, and hence also outputs a sample of size min(nrow(df), 10), so this might not be what is needed.
50

The data.table package provides the function DT[sample(.N, M)], sampling M random rows from the data table DT.

library(data.table) set.seed(10) mtcars <- data.table(mtcars) mtcars[sample(.N, 6)] mpg cyl disp hp drat wt qsec vs am gear carb 1: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 2: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 4: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 5: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 6: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 

Comments

36

Write one! Wrapping JC's answer gives me:

randomRows = function(df,n){ return(df[sample(nrow(df),n),]) } 

Now make it better by checking first if n<=nrow(df) and stopping with an error.

Comments

30

Just for completeness sake:

dplyr also offers to draw a proportion or fraction of the sample by

df %>% sample_frac(0.33) 

This is very convenient e.g. in machine learning when you have to do a certain split ratio like 80%:20%

Comments

13

As @matt_b indicates, sample_n() & sample_frac() have been soft deprecated in favour of slice_sample(). See the dplyr docs.

Example from docstring:

# slice_sample() allows you to random select with or without replacement mtcars %>% slice_sample(n = 5) mtcars %>% slice_sample(n = 5, replace = TRUE) 

Comments

10

Outdated answer. Please use dplyr::slice_sample() instead.

In my R package there is a function sample.rows just for this purpose:

install.packages('kimisc') library(kimisc) example(sample.rows) smpl..> set.seed(42) smpl..> sample.rows(data.frame(a=c(1,2,3), b=c(4,5,6), row.names=c('a', 'b', 'c')), 10, replace=TRUE) a b c 3 6 c.1 3 6 a 1 4 c.2 3 6 b 2 5 b.1 2 5 c.3 3 6 a.1 1 4 b.2 2 5 c.4 3 6 

Enhancing sample by making it a generic S3 function was a bad idea, according to comments by Joris Meys to a previous answer.

1 Comment

A note from ?sample_frac: "[Superseded] ‘sample_n()’ and ‘sample_frac()’ have been superseded in favour of ‘slice_sample()’"
9

EDIT: This answer is now outdated, see the updated version.

In my R package I have enhanced sample so that it now behaves as expected also for data frames:

library(devtools); install_github('kimisc', 'krlmlr') library(kimisc) example(sample.data.frame) smpl..> set.seed(42) smpl..> sample(data.frame(a=c(1,2,3), b=c(4,5,6), row.names=c('a', 'b', 'c')), 10, replace=TRUE) a b c 3 6 c.1 3 6 a 1 4 c.2 3 6 b 2 5 b.1 2 5 c.3 3 6 a.1 1 4 b.2 2 5 c.4 3 6 

This is achieved by making sample an S3 generic method and providing the necessary (trivial) functionality in a function. A call to setMethod fixes everything. The original implementation still can be accessed through base::sample.

10 Comments

What is unexpected about its treatment of data frames?
@adifferentben: When I call sample.default(df, ...) for a data frame df, it samples from the columns of the data frame, as a data frame is implemented as a list of vectors of the same length.
Is your package still available? I ran install_github('kimisc', 'krlmlr') and got Error: Does not appear to be an R package (no DESCRIPTION). Any way around that?
@JorisMeys: Agreed, except for the "as expected" part. Just because a data frame is implemented as a list internally, it doesn't mean it should behave as one. The [ operator for data frames is a counterexample. Also, please tell me: Have you ever, just one single time, used sample to sample columns from a data frame?
@krlmlr The [ operator is not a counterexample: iris[2] works like a list, as does iris[[2]]. Or iris$Species, lapply(iris, mean), ... Data frames are lists. So I expect them to behave like them. And yes, I have actually used sample(myDataframe). On a dataset where every variable contains expression data of a single gene. Your specific method helps novice users, but also effectively changing the way sample()behaves. Note I use "as expected" from a programmer's view. Which is different from the general intuition. There's a lot in R that's not compatible with general intuition... ;)
|
8

You could do this:

library(dplyr) cols <- paste0("a", 1:10) tab <- matrix(1:1000, nrow = 100) %>% as.tibble() %>% set_names(cols) tab # A tibble: 100 x 10 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 1 101 201 301 401 501 601 701 801 901 2 2 102 202 302 402 502 602 702 802 902 3 3 103 203 303 403 503 603 703 803 903 4 4 104 204 304 404 504 604 704 804 904 5 5 105 205 305 405 505 605 705 805 905 6 6 106 206 306 406 506 606 706 806 906 7 7 107 207 307 407 507 607 707 807 907 8 8 108 208 308 408 508 608 708 808 908 9 9 109 209 309 409 509 609 709 809 909 10 10 110 210 310 410 510 610 710 810 910 # ... with 90 more rows 

Above I just made a dataframe with 10 columns and 100 rows, ok?

Now you can sample it with sample_n:

sample_n(tab, size = 800, replace = T) # A tibble: 800 x 10 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 53 153 253 353 453 553 653 753 853 953 2 14 114 214 314 414 514 614 714 814 914 3 10 110 210 310 410 510 610 710 810 910 4 70 170 270 370 470 570 670 770 870 970 5 36 136 236 336 436 536 636 736 836 936 6 77 177 277 377 477 577 677 777 877 977 7 13 113 213 313 413 513 613 713 813 913 8 58 158 258 358 458 558 658 758 858 958 9 29 129 229 329 429 529 629 729 829 929 10 3 103 203 303 403 503 603 703 803 903 # ... with 790 more rows 

Comments

6

You could do this:

sample_data = data[sample(nrow(data), sample_size, replace = FALSE), ] 

Comments

6

The 2021 way of doing this in the tidyverse is:

library(tidyverse) df = data.frame( A = letters[1:10], B = 1:10 ) df #> A B #> 1 a 1 #> 2 b 2 #> 3 c 3 #> 4 d 4 #> 5 e 5 #> 6 f 6 #> 7 g 7 #> 8 h 8 #> 9 i 9 #> 10 j 10 df %>% sample_n(5) #> A B #> 1 e 5 #> 2 g 7 #> 3 h 8 #> 4 b 2 #> 5 j 10 df %>% sample_frac(0.5) #> A B #> 1 i 9 #> 2 g 7 #> 3 j 10 #> 4 c 3 #> 5 b 2 

Created on 2021-10-05 by the reprex package (v2.0.0.9000)

Comments

5

Select a Random sample from a tibble type in R:

library("tibble") a <- your_tibble[sample(1:nrow(your_tibble), 150),] 

nrow takes a tibble and returns the number of rows. The first parameter passed to sample is a range from 1 to the end of your tibble. The second parameter passed to sample, 150, is how many random samplings you want. The square bracket slicing specifies the rows of the indices returned. Variable 'a' gets the value of the random sampling.

Comments

3

I'm new in R, but I was using this easy method that works for me:

sample_of_diamonds <- diamonds[sample(nrow(diamonds),100),] 

PS: Feel free to note if it has some drawback I'm not thinking about.

6 Comments

Suppose, I have 1000 rows in my df. After applying your code 100 rows will be selected randomly and then how I can store the rest of the 900 rows (which one did not select randomly)?
@Akib62 try (rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)])
Not working. When I am using your code (given in the comment) getting the same output as the diamonds or main dataset.
@Akib62 since that selects the elements not in sample_of_diamonds, can you confirm sample_of_diamonds is not empty? That could explain your problem.
Say, I have 20 rows in my dataset. So when I am applying sample_of_diamonds <- diamonds[sample(nrow(diamonds),10),] I am getting 10 rows randomly and rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)] I am getting 20 rows (main dataset)
|
-1
df <- data.frame( ID = 1:10, Name = LETTERS[1:10], Score = sample(50:100, 10, replace = TRUE) ) n <- 10 random_rows <- df[sample(nrow(df), n), ] print(random_rows) 

1 Comment

Thank you for your interest in contributing to the Stack Overflow community. This question already has quite a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.