Skip to main content
added 5 characters in body
Source Link
Arun
  • 119.2k
  • 28
  • 290
  • 396

However,Even though there are a total of 250,000 rows, and your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

Even though thedata.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

However, there are a total of 250,000 rows, and your data size is ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

Even though the grouping is >100x faster here, it's clearly not the reason for such slowness...

Even though there are a total of 250,000 rows your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

data.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

added 68 characters in body
Source Link
Arun
  • 119.2k
  • 28
  • 290
  • 396

Here's my take on it. I'll assume df and dt to be the names of objects for easy/quick typing.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=cby = c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002 

Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

dt[, .(datetime = base::max(datetime)), by = .(id1, id2)] 

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

Hope this helps.

Here's my take on it. I'll assume df and dt to be the names of objects for easy/quick typing.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002 

Note that this is only fast because the expression gets optimised to gmax().

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

Hope this helps.

I'll assume df and dt to be the names of objects for easy/quick typing.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by = c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002 

Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

dt[, .(datetime = base::max(datetime)), by = .(id1, id2)] 

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

edited body
Source Link
Arun
  • 119.2k
  • 28
  • 290
  • 396
system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.003002 

So why isn't datetime == max(datetime) workinginstant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"))) # 0.002 0.000 0.003 

So why isn't datetime == max(datetime) working? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

system.time(group_by(df, id1, id2)) # 0.303 0.007 0.311 system.time(data.table:::forderv(dt, by=c("id1", "id2"), retGrp = TRUE)) # 0.002 0.000 0.002 

So why isn't datetime == max(datetime) instant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

deleted 2 characters in body
Source Link
Arun
  • 119.2k
  • 28
  • 290
  • 396
Loading
Source Link
Arun
  • 119.2k
  • 28
  • 290
  • 396
Loading