Revisions to How to speed up subset by groups

added 5 characters in body

edited Aug 6, 2015 at 13:25

119.2k
28
290
396

However,Even though there are a total of 250,000 rows, and your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

Even though thedata.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

However, there are a total of 250,000 rows, and your data size is ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

Even though the grouping is >100x faster here, it's clearly not the reason for such slowness...

Even though there are a total of 250,000 rows your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

data.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

added 68 characters in body

Source Link

edited Aug 6, 2015 at 11:24

Arun

119.2k
28
290
396

Here's my take on it. I'll assume df and dt to be the names of objects for easy/quick typing.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=cby = c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002

Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

dt[, .(datetime = base::max(datetime)), by = .(id1, id2)]

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

Hope this helps.

Here's my take on it. I'll assume df and dt to be the names of objects for easy/quick typing.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002

Note that this is only fast because the expression gets optimised to gmax().

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

Hope this helps.

I'll assume df and dt to be the names of objects for easy/quick typing.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by = c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002

Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

dt[, .(datetime = base::max(datetime)), by = .(id1, id2)]

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

edited body

Source Link

edited Aug 6, 2015 at 11:15

Arun

119.2k
28
290
396

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.003002

So why isn't datetime == max(datetime) workinginstant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"))) # 0.002 0.000 0.003

So why isn't datetime == max(datetime) working? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002

So why isn't datetime == max(datetime) instant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

deleted 2 characters in body

Source Link

edited Aug 6, 2015 at 11:08

Arun

119.2k
28
290
396

Loading

Source Link

answered Aug 6, 2015 at 11:05

Arun

119.2k
28
290
396

Loading

Collectives™ on Stack Overflow

Return to Answer