0

I have this final dataset of roughly 150 000 rows per 40 columns that covers all my potential samples from 1932 to 2016, and I need to make a random selection of 53 samples per year for a total number of ~5000.

The selection in itself is really straight forward using the sample() function to get a subset, however I need to display the selection in the original dataframe to be able to check various things. My issue is the following:

If I edit one of the fields in my random subset and merge it back with the main one, it creates duplicates that I can't remove because one field changed and thus R considers the two rows aren't duplicates. If I don't edit anything, I can't find which rows were selected.

My solution for now was to merge everything in Excel instead of R, apply color codes to highlight the selected rows and delete manually the duplicates. However it's time consuming, prone to mistakes and not practicable as the dataset seems to be too big and my PC quickly runs out of memory when I try...

UPDATE:

Here's a reproducible example:

dat <- data.frame( X = sample(2000:2016, 50, replace=TRUE), Y = sample(c("yes", "no"), 50, replace = TRUE), Z = sample(c("french","german","english"), 50, replace=TRUE) ) dat2 <- subset(dat, dat$X==2000) #samples of year 2000 sc <- dat2[sample(nrow(dat2), 1), ] #Random selection of 1 

What I would like to do is select directly in the dataset (dat1), for example by randomly assigning the value "1" in a column called "selection". Or, if not possible, how can I merge the sampled rows (here called "sc") back to the main dataset but with something indicating they have been sampled

Note:

I've been using R sporadically for the last 2 years and I'm a fairly inexperienced user, so I apologize if this is a silly question. I've been roaming Google and SO for the last 3 days and couldn't find any relevant answer yet.

I recently got in a PhD program in biology that requires me to handle a lot of data from an archive.

4
  • 1
    Welcome to StackOverflow! Please read the info about how to ask a good question and how to give a reproducible example. This will make it much easier for others to help you. Commented Jan 19, 2018 at 9:58
  • 1
    Sorry, it's really not clear what exactly you're asking. Please provide a reproducible example on a minimal data set Commented Jan 19, 2018 at 9:58
  • Please provide a reproducible example. To me your latest update dose not clarify your question. Commented Jan 19, 2018 at 10:20
  • Updated with a reproducible example Commented Jan 19, 2018 at 10:47

1 Answer 1

1

EDIT: updated based on comments.

You could add a column that indicates if a row is part of your sample. So maybe try the following:

df = data.frame(year= c(1,1,1,1,1,1,2,2,2,2,2,2), id=c(1,2,3,4,5,6,7,8,9,10,11,12),age=c(7,7,7,12,12,12,7,7,7,12,12,12)) library(dplyr) n_per_year_low_age = 2 n_per_year_high_age = 1 df <- df %>% group_by(year) %>% mutate(in_sample1 = as.numeric(id %in% sample(id[age<8],n_per_year_low_age))) %>% mutate(in_sample2 = as.numeric(id %in% sample(id[age>8],n_per_year_high_age))) %>% mutate(in_sample = in_sample1+in_sample2) %>% select(-in_sample1,-in_sample2) 

Output:

# A tibble: 12 x 4 # Groups: year [2] year id age in_sample <dbl> <dbl> <dbl> <dbl> 1 1.00 1.00 7.00 1.00 2 1.00 2.00 7.00 1.00 3 1.00 3.00 7.00 0 4 1.00 4.00 12.0 1.00 5 1.00 5.00 12.0 0 6 1.00 6.00 12.0 0 7 2.00 7.00 7.00 1.00 8 2.00 8.00 7.00 0 9 2.00 9.00 7.00 1.00 10 2.00 10.0 12.0 0 11 2.00 11.0 12.0 0 12 2.00 12.0 12.0 1.00 

Futher operations are then trivial:

# extracting your sample df %>% filter(in_sample==1) # comparing statistics of your sample against the rest of the population df %>% group_by(year,in_sample) %>% summarize(mean(id)) 
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the example. However when trying your code I run into an error for the last two lines, telling me the column "in_sample" doesn't exist. I'm not familiar with dplyr, but it seems it allows for direct operations on the dataframe without actually modifying it?
Hi C.Den! In my answer I did not store the modified object, you have to assign it back to the variable again. So df <- df %>% group_by(year) %>% mutate( in_sample = as.numeric(id %in% sample(id,n_per_year))) instead of just df %>% group_by(year) %>% mutate( in_sample = as.numeric(id %in% sample(id,n_per_year))). I modifed my answer accordingly, hope that helps!
Damn, I feel stupid now. Your solution seems nice, now I need to adapt it to my dataset. In my post I oversimplified but in reality my selection is composed as follows: for each year I have to select 50 samples of age 8 (for chronology) + 3 samples of age >10 (for isotopes). Because the archives have a high proportion of non usable samples (~40%) I always do a selection of 100+6 samples to have some margin. How could I implement these selection criteria using dplyr's syntax?
No problem, I updated my answer, I hope that explains itself. You can easily adjust the conditions age<8, e.g. age==8. In the future however, please consider posing any additional adjustments to your question as a new question ;) Hope this helps!
Alright, will keep that in mind! Thanks for all these answers.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.