Difference between subset and filter from dplyr

Question

It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?

Example:

library(dplyr) df1<-subset(airquality, Temp>80 & Month > 5) df2<-filter(airquality, Temp>80 & Month > 5) summary(df1$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 summary(df2$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14

This post compares subset, filter,with and [, how-to-use-or-and-in-dplyr-to-subset-a-data-frame — Silence Dogood
– Silence Dogood, Commented Oct 5, 2016 at 20:11
The main difference is that subset comes with a warning in ?subset: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." filter is designed to work robustly with the rest of dplyr and the tidyverse, both interactively and programmatically, and has a separate standard-eval version filter_ for when necessary. Also, it treats commas as &. — alistaire
– alistaire, Commented Oct 5, 2016 at 20:23
@alistaire just an update that filter_() and _ versions of dplyr functions in general are now deprecated in favor of tidy evaluation semantics. For details on current best practices, see programming with dplyr. — Bryan Shalloway
– Bryan Shalloway, Commented Oct 1, 2020 at 15:11

Benjamin · Accepted Answer · 2016-10-05 20:05:50Z

They are, indeed, producing the same result, and they are very similar in concept.

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

So in terms of human time, I don't think there's much difference between the two.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

library(dplyr) library(microbenchmark) # Original example microbenchmark( df1<-subset(airquality, Temp>80 & Month > 5), df2<-filter(airquality, Temp>80 & Month > 5) ) Unit: microseconds expr min lq mean median uq max neval cld subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b # 15,300 rows air <- lapply(1:100, function(x) airquality) %>% bind_rows microbenchmark( df1<-subset(air, Temp>80 & Month > 5), df2<-filter(air, Temp>80 & Month > 5) ) Unit: microseconds expr min lq mean median uq max neval cld subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a # 153,000 rows air <- lapply(1:1000, function(x) airquality) %>% bind_rows microbenchmark( df1<-subset(air, Temp>80 & Month > 5), df2<-filter(air, Temp>80 & Month > 5) ) Unit: milliseconds expr min lq mean median uq max neval cld subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a

Sir, for me the results are just opposite! For both cases subset is performing better than filter on my machine.
there could be a half dozen reasons for that. is the difference in execution large enough to care about?
subset 1.164632 1.220479 1.717666 1.266967 1.421527 , filter 5.314198 5.440985 5.669854 5.595846 5.793876

rsmith54 · Accepted Answer · 2017-03-31 15:57:49Z

One additional difference not yet mentioned is that filter discards rownames, while subset doesn't:

filter(mtcars, gear == 5) mpg cyl disp hp drat wt qsec vs am gear carb 1 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 2 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 3 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4 4 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6 5 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8 subset(mtcars, gear == 5) mpg cyl disp hp drat wt qsec vs am gear carb Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 Ford Pantera L 15.8 4 351.0 264 4.22 3.170 14.5 0 1 5 4 Ferrari Dino 19.7 4 145.0 175 3.62 2.770 15.5 0 1 5 6 Maserati Bora 15.0 4 301.0 335 3.54 3.570 14.6 0 1 5 8

This can be critical in some usecases, where row names are essential and there are advantages to keep them out of the main data such as when computing distance matrix for clustering

moodymudskipper · Accepted Answer · 2021-01-07 12:33:20Z

In the main use cases they behave the same :

library(dplyr) identical( filter(starwars, species == "Wookiee"), subset(starwars, species == "Wookiee")) # [1] TRUE

But they have a quite a few differences, including (I was as exhaustive as possible but might have missed some) :

subset can be used on matrices
filter can be used on databases
filter drops row names
subset drop attributes other than class, names and row names.
subset has a select argument
subset recycles its condition argument
filter supports conditions as separate arguments
filter preserves the class of the column
filter supports the .data pronoun
filter supports some rlang features
filter supports grouping
filter supports n() and row_number()
filter is stricter
filter is a bit faster when it counts
subset has methods in other packages

`subset` can be used on matrices

subset(state.x77, state.x77[,"Population"] < 400) # Population Income Illiteracy Life Exp Murder HS Grad Frost Area # Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 # Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203

Though columns can't be used directly as variables in the subset argument

subset(state.x77, Population < 400)

Error in subset.matrix(state.x77, Population < 400) : object 'Population' not found

Neither works with filter

filter(state.x77, state.x77[,"Population"] < 400)

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('matrix', 'double', 'numeric')"

filter(state.x77, Population < 400)

Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('matrix', 'double', 'numeric')"

`filter` can be used on databases

library(DBI) con <- dbConnect(RSQLite::SQLite(), ":memory:") dbWriteTable(con, "mtcars", mtcars) tbl(con,"mtcars") %>% filter(hp < 65) # # Source: lazy query [?? x 11] # # Database: sqlite 3.19.3 [:memory:] # mpg cyl disp hp drat wt qsec vs am gear carb # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

subset can't

tbl(con,"mtcars") %>% subset(hp < 65)

Error in subset.default(., hp < 65) : object 'hp' not found

`filter` drops row names

filter(mtcars, hp < 65) # mpg cyl disp hp drat wt qsec vs am gear carb # 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

subset doesn't

subset(mtcars, hp < 65) # mpg cyl disp hp drat wt qsec vs am gear carb # Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

`subset` drop attributes other than class, names and row names.

cars_head <- head(cars) attr(cars_head, "info") <- "head of cars dataset" attributes(subset(cars_head, speed > 0)) #> $names #> [1] "speed" "dist" #> #> $row.names #> [1] 1 2 3 4 5 6 #> #> $class #> [1] "data.frame" attributes(filter(cars_head, speed > 0)) #> $names #> [1] "speed" "dist" #> #> $row.names #> [1] 1 2 3 4 5 6 #> #> $class #> [1] "data.frame" #> #> $info #> [1] "head of cars dataset"

`subset` has a `select` argument

While dplyr follows tidyverse principles which aim at having each function doing one thing, so select is a separate function.

identical( subset(starwars, species == "Wookiee", select = c("name", "height")), filter(starwars, species == "Wookiee") %>% select(name, height) ) # [1] TRUE

It also has a drop argument, that makes mostly sense in the context of using the select argument.

`subset` recycles its condition argument

half_iris <- subset(iris,c(TRUE,FALSE)) dim(iris) # [1] 150 5 dim(half_iris) # [1] 75 5

filter doesn't

half_iris <- filter(iris,c(TRUE,FALSE))

Error in filter_impl(.data, quo) : Result must have length 150, not 2

`filter` supports conditions as separate arguments

Conditions are fed to ... so we can have several conditions as different arguments, which is the same as using & but might be more readable sometimes due to logical operator precedence and automatic identation.

identical( subset(starwars, (species == "Wookiee" | eye_color == "blue") & mass > 120), filter(starwars, species == "Wookiee" | eye_color == "blue", mass > 120) )

`filter` preserves the class of the column

df <- data.frame(a=1:2, b = 3:4, c= 5:6) class(df$a) <- "foo" class(df$b) <- "Date" # subset preserves the Date, but strips the "foo" class str(subset(df,TRUE)) #> 'data.frame': 2 obs. of 3 variables: #> $ a: int 1 2 #> $ b: Date, format: "1970-01-04" "1970-01-05" #> $ c: int 5 6 # filter keeps both str(dplyr::filter(df,TRUE)) #> 'data.frame': 2 obs. of 3 variables: #> $ a: 'foo' int 1 2 #> $ b: Date, format: "1970-01-04" "1970-01-05" #> $ c: int 5 6

`filter` supports the use use of the `.data` pronoun

mtcars %>% filter(.data[["hp"]] < 65) # mpg cyl disp hp drat wt qsec vs am gear carb # 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

`filter` supports some `rlang` features

x <- "hp" library(rlang) mtcars %>% filter(!!sym(x) < 65) # m pg cyl disp hp drat wt qsec vs am gear carb # 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 filter65 <- function(data,var){ data %>% filter(!!enquo(var) < 65) } mtcars %>% filter65(hp) # mpg cyl disp hp drat wt qsec vs am gear carb # 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 # 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

`filter` supports grouping

iris %>% group_by(Species) %>% filter(Petal.Length < quantile(Petal.Length,0.01)) # # A tibble: 3 x 5 # # Groups: Species [3] # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <dbl> <dbl> <dbl> <dbl> <fctr> # 1 4.6 3.6 1.0 0.2 setosa # 2 5.1 2.5 3.0 1.1 versicolor # 3 4.9 2.5 4.5 1.7 virginica iris %>% group_by(Species) %>% subset(Petal.Length < quantile(Petal.Length,0.01)) # # A tibble: 2 x 5 # # Groups: Species [1] # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <dbl> <dbl> <dbl> <dbl> <fctr> # 1 4.3 3.0 1.1 0.1 setosa # 2 4.6 3.6 1.0 0.2 setosa

`filter` supports `n()` and `row_number()`

filter(iris, row_number() < n()/30) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa

`filter` is stricter

It trigger errors if the input is suspicious.

filter(iris, Species = "setosa") # Error: `Species` (`Species = "setosa"`) must not be named, do you need `==`? identical(subset(iris, Species = "setosa"), iris) # [1] TRUE df1 <- setNames(data.frame(a = 1:3, b=5:7),c("a","a")) # df1 # a a # 1 1 5 # 2 2 6 # 3 3 7 filter(df1, a > 2) #Error: Column `a` must have a unique name subset(df1, a > 2) # a a.1 # 3 3 7

`filter` is a bit faster when it counts

Borrowing the dataset that Benjamin built in his answer (153 k rows), it's twice faster, though it should rarely be a bottleneck.

air <- lapply(1:1000, function(x) airquality) %>% bind_rows microbenchmark::microbenchmark( subset = subset(air, Temp>80 & Month > 5), filter = filter(air, Temp>80 & Month > 5) ) # Unit: milliseconds # expr min lq mean median uq max neval cld # subset 8.771962 11.551255 19.942501 12.576245 13.933290 108.0552 100 b # filter 4.144336 4.686189 8.024461 6.424492 7.499894 101.7827 100 a

`subset` has methods in other packages

subset is an S3 generic, just as dplyr::filter is, but subset as a base function is more likely to have methods developed in other packages, one prominent example is zoo:::subset.zoo.

Maria Wollestonecraft · Accepted Answer · 2017-08-05 13:37:06Z

Interesting. I was trying to see the difference in terms of the resulting dataset and I coulnd't get an explanation to why the "[" operator behaved differently (i.e., to why it also returned NAs):

# Subset for year=2013 sub<-brfss2013 %>% filter(iyear == "2013") dim(sub) #[1] 486088 330 length(which(is.na(sub$iyear))==T) #[1] 0 sub2<-filter(brfss2013, iyear == "2013") dim(sub2) #[1] 486088 330 length(which(is.na(sub2$iyear))==T) #[1] 0 sub3<-brfss2013[brfss2013$iyear=="2013", ] dim(sub3) #[1] 486093 330 length(which(is.na(sub3$iyear))==T) #[1] 5 sub4<-subset(brfss2013, iyear=="2013") dim(sub4) #[1] 486088 330 length(which(is.na(sub4$iyear))==T) #[1] 0

R. Prost · Accepted Answer · 2018-06-20 07:57:03Z

A difference is also that subset does more things than filter you can also select and drop while you have two different functions in dplyr

subset(df, select=c("varA", "varD")) dplyr::select(df,varA, varD)

Albert · Accepted Answer · 2018-09-12 09:55:32Z

An additional advantage of filter is that it plays nice with grouped data. subset ignores groupings.

So when the data is grouped, subset will still make reference to the whole data, but filter will only reference the group.

# setup library(tidyverse) data.frame(a = 1:2) %>% group_by(a) %>% subset(length(a) == 1) # returns empty table data.frame(a = 1:2) %>% group_by(a) %>% filter(length(a) == 1) # returns all rows

Collectives™ on Stack Overflow

Difference between subset and filter from dplyr

6 Answers 6

3 Comments

1 Comment

`subset` can be used on matrices

`filter` can be used on databases

`filter` drops row names

`subset` drop attributes other than class, names and row names.

`subset` has a `select` argument

`subset` recycles its condition argument

`filter` supports conditions as separate arguments

`filter` preserves the class of the column

`filter` supports the use use of the `.data` pronoun

`filter` supports some `rlang` features

`filter` supports grouping

`filter` supports `n()` and `row_number()`

`filter` is stricter

`filter` is a bit faster when it counts

`subset` has methods in other packages

1 Comment

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

1 Comment

subset can be used on matrices

filter can be used on databases

filter drops row names

subset drop attributes other than class, names and row names.

subset has a select argument

subset recycles its condition argument

filter supports conditions as separate arguments

filter preserves the class of the column

filter supports the use use of the .data pronoun

filter supports some rlang features

filter supports grouping

filter supports n() and row_number()

filter is stricter

filter is a bit faster when it counts

subset has methods in other packages

1 Comment

Comments

Comments

Comments

Linked

Related

`subset` can be used on matrices

`filter` can be used on databases

`filter` drops row names

`subset` drop attributes other than class, names and row names.

`subset` has a `select` argument

`subset` recycles its condition argument

`filter` supports conditions as separate arguments

`filter` preserves the class of the column

`filter` supports the use use of the `.data` pronoun

`filter` supports some `rlang` features

`filter` supports grouping

`filter` supports `n()` and `row_number()`

`filter` is stricter

`filter` is a bit faster when it counts

`subset` has methods in other packages