I used to achieve my data wrangling with dplyr, but some of the computations are "slow". In particular subset by groups, I read that dplyr is slow when there is a lot of groups and based on this benchmark data.table could be faster so I started to learn data.table.
Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.
Datas
# random datetime generation function by Dirk Eddelbuettel # https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/05") { st <- as.POSIXct(as.Date(st)) et <- as.POSIXct(as.Date(et)) dt <- as.numeric(difftime(et,st,unit="sec")) ev <- sort(runif(N, 0, dt)) rt <- st + ev } set.seed(42) # Creating 230000 ids couples ids <- data.frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"), id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]")) # Repeating randomly the ids[1:2000, ] to create groups ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ]) # Adding random datetime variable and dummy variables to reproduce real datas datas <- transform(ids, datetime = rand.datetime(25e4), var1 = sample(LETTERS[1:6], 25e4, rep = TRUE), var2 = sample(c(1:10, NA), 25e4, rep = TRUE), var3 = sample(c(1:10, NA), 25e4, rep = TRUE), var4 = rand.datetime(25e4), var5 = rand.datetime(25e4)) datas.tbl <- tbl_df(datas) datas.dt <- data.table(datas, key = c("id1", "id2")) I couldn't find the straight way to subset by groups with data.table so I asked this question : Filter rows by groups with data.table
We suggest me to use .SD :
datas.dt[, .SD[datetime == max(datetime)], by = c("id1", "id2")] But I have two problems, it works with date but not with POSIXct ("Error in UseMethod("as.data.table") : no applicable method for 'as.data.table' applied to an object of class "c('POSIXct', 'POSIXt')""), and this is very slow. For example with Dates :
> system.time({ + datas.dt[, .SD[as.Date(datetime) == max(as.Date(datetime))], by = c("id1", "id2")] + }) utilisateur système écoulé 207.03 0.00 207.48 So I found other way much faster to achieve this (and keeping datetimes) with data.table :
Functions
f.dplyr <- function(x) x %>% group_by(id1, id2) %>% filter(datetime == max(datetime)) f.dt.i <- function(x) x[x[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1] f.dt <- function(x) x[x[, datetime == max(datetime), by = c("id1", "id2")]$V1] But then I thought data.table would be much faster, the time difference with dplyr isn't significative.
Microbenchmark
mbm <- microbenchmark( dplyr = res1 <- f.dplyr(datas.tbl), data.table.I = res2 <- f.dt.i(datas.dt), data.table = res3 <- f.dt(datas.dt), times = 50L) Unit: seconds expr min lq mean median uq max neval dplyr 31.84249 32.24055 32.59046 32.61311 32.88703 33.54226 50 data.table.I 30.02831 30.94621 31.19660 31.17820 31.42888 32.16521 50 data.table 30.28923 30.84212 31.09749 31.04851 31.40432 31.96351 50 Am I missing/misusing something with data.table ? Do you have ideas to speed up this computation ?
Any help would be highly appreciated ! Thanks
Edit : Some precisions about the system and packages versions used for the microbenchmark. (The computer isn't a war machine, 12Go i5)
System
sessionInfo() R version 3.1.3 (2015-03-09) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C [5] LC_TIME=French_France.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_0.1.0 ggplot2_1.0.1 microbenchmark_1.4-2 [4] data.table_1.9.4 dplyr_0.4.1 plyr_1.8.2 loaded via a namespace (and not attached): [1] assertthat_0.1 chron_2.3-45 colorspace_1.2-6 DBI_0.3.1 [5] digest_0.6.8 grid_3.1.3 gtable_0.1.2 lazyeval_0.1.10 [9] magrittr_1.5 MASS_7.3-39 munsell_0.4.2 parallel_3.1.3 [13] proto_0.3-10 Rcpp_0.11.5 reshape2_1.4.1 scales_0.2.4 [17] stringi_0.4-1 stringr_0.6.2 tools_3.1.3 > packageVersion("data.table") [1] ‘1.9.4’ > packageVersion("dplyr") [1] ‘0.4.1’ 
which.maxreturns? Alsodatas.dt[, .SD[as.Date(datetime) == max(as.Date(datetime))], by = c("id1", "id2")]is a bad practice. You should convertdatetoIDateclass before subsetting.x %>% group_by(id1, id2) %>% slice(which(datetime == max(datetime)))to your comparison?datas.dt[, datetime := as.IDate(datetime)] ; system.time(datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])runs only 5 seconds compared to 200 when using.SD, so I find hard to believe your benchmarks.data.tablesolution because he assumes it will be faster thandplyr- this is why I compare your proposal againstdata.tablein case his assumption is wrong.