subsetting in data.table

Question

I am trying to subset a data.table ( from the package data.table ) in R (not a data.frame). I have a 4 digit year as a key. I would like to subset by taking a series of years. For example, I want to pull all the records that are from 1999, 2000, 2001.

I have tried passing in my DT[J(year)] binary search syntax the following:

1999,2000,2001 c(1999,2000,2001) 1999, 2000, 2001

but none of these seem to work. Anyone know how to do a subset where the years you want to select are not just 1 but multiple years?

Sorry for not being a good citizen on Stackoverflow. Will attend to this now. Will also be more mindful about getting references included to save time for those who are trying to help me. — exl
– exl, Commented Mar 31, 2011 at 13:50
@Andrie : question is edited to include it (@exl did that, I just made it a bit more clear), so your downvote can be reversed if you wished to do so. For the rest the question is at least valid. — Joris Meys
– Joris Meys, Commented Mar 31, 2011 at 17:13
@Joris, That is already an improvement, so I have reversed my downvote. However, for this to be a good question, it needs a library(data.table) statement plus some real example code. — Andrie
– Andrie, Commented Mar 31, 2011 at 20:29
+1 to reverse some excessive markdowns. -1 for no data.table ref perhaps, but -5? And why the need to list error messages for such a simple matter of syntax? — geotheory
– geotheory, Commented Oct 29, 2013 at 13:41

Richie Cotton · Accepted Answer · 2011-03-30 16:41:48Z

20

What works for data.frames works for data.tables.

subset(DT, year %in% 1999:2001)

answered Mar 30, 2011 at 16:41

Richie Cotton

122k47 gold badges254 silver badges371 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

naught101 Over a year ago

Not everything: dt <- data.table(a=c(1,2,3), b=c(4,5,6), c=c(7,8,9)); dt[, c('b', 'c')] returns a vector of column names - with data.frame, it returns the columns.

dracodoc Over a year ago

data.table has its own subset now: subset.data.table {data.table}

Community · Accepted Answer · 2017-05-23 12:02:17Z

The question is not clear and does not provide sufficient data to work with BUT it is usefull, so if some one can edit it with the data I provide hereafter, one is welcome. The title of the post could also be completed : Matthew Dowle often answers the subsetting-over-two-vectors question, but less frequently the subsetting-according-a-in-statement-on-one-vector one. I have been looking a while for an answer, untill finding one for character vectors here.

Let's consider this data :

library(data.table) n <- 100 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n)

The data.table-style query corresponding to X[X$a %in% c(10,20),] is somehow surprising :

setkey(X,a) X[.(c(10,20))] X[.(10,20)] # works for characters but not for integers # instead, treats 10 as the filter # and 20 as a new variable # for comparison : X[X$a %in% c(10,20),]

Now, which is best? If your key is already set, data.table, obviously. Otherwise, it might not, as prove the following time-measurements (on my 1,75 Go RAM computer) :

n <- 1e7 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n) system.time(X[X$a %in% c(10,20),]) # utilisateur système écoulé (yes, I'm French) # 1.92 0.06 1.99 system.time(setkey(X,a)) # utilisateur système écoulé # 34.91 0.05 35.23 system.time(X[J(c(10,20))]) # utilisateur système écoulé # 0.15 0.08 0.23

But maybe Matthew has better solutions...

[Matthew] You've discovered that sorting type numeric (a.k.a. double) is much slower than integer. For many years we didn't allow double in keys for fear of users falling into this trap and reporting terrible timings like this. We allowed double in keys with some trepidation because fast sorting isn't implemented for double yet. Fast sorting on integer and character is pretty good because those are done using a counting sort. ~~Hopefully we'll get to fast sorting numeric one day!~~ (Now implemented - see below).

Timings on data.table pre-1.9.0

n <- 1e7 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n) system.time(setkey(X,a)) # user system elapsed # 13.898 0.138 14.216 X <- data.table(a=sample(as.integer(c(10,20,25,30,40)),n,replace=TRUE),b=1:n) system.time(setkey(X,a)) # user system elapsed # 0.381 0.019 0.408

Rememeber that 2 is type numeric in R by default. 2L is integer. Although data.table accepts numeric it still much prefers integer.

Fast radix sort for numerics is implemented since v1.9.0.

From v1.9.0 on

n <- 1e7 X <- data.table(a=sample(c(10,20,25,30,40),n,replace=TRUE),b=1:n) system.time(setkey(X,a)) # user system elapsed # 0.832 0.026 0.871

On the "less frequently" comment can you provide some links? I'm not aware I've avoided answering any questions at all. Also I don't follow the "works for characters but not integers" bit, do you have an example of it working for character? Needing to make a vector using c() inside a call to list() or data.frame|table is a common R idiom.
But now you've answered this question, I see what the asker was asking now. But at the time (2 years ago) I honestly did not understand. I usually guess if I can (it would have helped if the 1st and 3rd attempts in the question had been valid syntax). I agree it's actually a good question, deep down.
I did not mean you actually avoided to answer :) It's just that the question is less often asked. OK, fot the Integer case, my mistake. For an example with characters, see the link provided in the beginning. With X <- data.table(a=sample(as.character(c(10,20,25,30,40)),n,replace=TRUE),b=1:n), you can subset X according to a subset of a with X[.('10','20')].

Yike Lu · Accepted Answer · 2012-05-26 16:50:37Z

8

Like the above, but more data.table esque:

DT[year %in% c(1999, 2000, 2001)]

answered May 26, 2012 at 16:50

Yike Lu

1,03511 silver badges18 bronze badges

2 Comments

Matt Dowle Over a year ago

True but we don't want to encourage vector scans. And the question is pretty bad. If they had included basic example and the error message, it could have been solved using binary search.

Frank Over a year ago

These days, this is the right answer thanks to auto indexing, I guess.

Paul Lemmens · Accepted Answer · 2014-10-23 14:50:36Z

This will work:

sample_DT = data.table(year = rep(1990:2010, length.out = 1000), random_number = rnorm(1000), key = "year") year_subset = sample_DT[J(c(1990, 1995, 1997))]

Similarly, you can key an already existing data.table with setkey(existing_DT, year) and then use the J() syntax as shown above.

I think the problem may be that you didn't key the data first.

Collectives™ on Stack Overflow

subsetting in data.table

4 Answers 4

2 Comments

Timings on data.table pre-1.9.0

From v1.9.0 on

3 Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Timings on data.table pre-1.9.0

From v1.9.0 on

3 Comments

2 Comments

Comments

Linked

Related