1

I have the following dataset (32000 entries) of water chemical compounds annual means organized by monitoring sites and sampling year:

data= data.frame(Site_ID=c(1, 1, 1, 2, 2, 2, 3, 3, 3), Year=c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005), AnnualMean=c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9)) 
Site_ID Year AnnualMean 1 1976 1.1 1 1977 1.2 1 1978 1.1 2 2004 2.1 2 2005 2.6 2 2006 3.1 3 2003 2.7 3 2004 2.6 3 2005 1.9 

I would like to select the data only from all monitoring sites showing at least a measurement in 2005 in their time range. With the above dataset, the expect output dataset would be:

Site_ID Year AnnualMean 2 2004 2.1 2 2005 2.6 2 2006 3.1 3 2003 2.7 3 2004 2.6 3 2005 1.9 

I am completely new in R and have been spinning my head around with data manipulation, so thank you in advance!

3 Answers 3

4

With dplyr:

library(dplyr) data %>% group_by(Site_ID) %>% filter(2005 %in% Year) 
Sign up to request clarification or add additional context in comments.

3 Comments

And this is why dplyr is so elegant. This would be a base R way of doing it. data[data$Site_ID %in% data[data$Year %in% 2005,]$Site_ID,]
As a small revision to my question, how should I write the filter request to only select the data from monitoring sites that have at least 10 measurements between 1990 and 2005 ? I tried filter(n()>=10 %in% between(phenomenonTimeReferenceYear, 1990, 2012)) without success. Thank you !
That's quite a bit different - I'd recommend a new question.
1

Here is a base R solution, using subset + ave

dfout <- subset(df,!!ave(Year,Site_ID,FUN = function(x) "2005" %in% x)) 

such that

> dfout Site_ID Year AnnualMean 4 2 2004 2.1 5 2 2005 2.6 6 2 2006 3.1 7 3 2003 2.7 8 3 2004 2.6 9 3 2005 1.9 

Comments

0

An option with data.table

library(data.table) setDT(data)[, .SD[2005 %in% Year], Site_ID] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.