k_th smallest element per group in R

Question

I have a data table, dt, that looks like

 location year value NYC 2026 1 NYC 2026 2 NYC 2026 3 NYC 2026 4 NYC 2026 5 LA 2026 6 LA 2026 7 LA 2026 8 LA 2026 9 LA 2026 10

I would like to group them by city and year and find the 2nd smallest element in the column value, per each group, where the desired result looks like:

 location year value NYC 2026 2 LA 2026 7

dt %>% grou_by(location, year) %>% nth(value, 2)

Would not work. Any help is appreciated.

The data table above can be created by:

dt <- structure(list(location = c("NYC", "NYC", "NYC","NYC", "NYC", "LA", "LA", "LA", "LA", "LA"), year = c(2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026), value = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), class = "data.table", row.names = c(NA, -10L))

tmfmnk · Accepted Answer · 2019-05-10 21:26:06Z

One dplyr possibility could be:

df %>% group_by(location) %>% arrange(value) %>% slice(2)

Here it groups by "location" column, arranges the values according "value" column and then keeps the second element.

 location year value <chr> <int> <int> 1 LA 2026 7 2 NYC 2026 2

Or if the values in "value" column could be duplicate, then you can do:

df %>% group_by(location) %>% distinct(value, .keep_all = TRUE) %>% arrange(value) %>% slice(2)

Or using filter() instead of slice():

df %>% group_by(location) %>% arrange(value) %>% filter(row_number() == 2)

The same considering also possible duplicates:

df %>% group_by(location) %>% distinct(value, .keep_all = TRUE) %>% arrange(value) %>% filter(row_number() == 2)

Or using filter() and dense_rank():

df %>% group_by(location) %>% filter(dense_rank(value) == 2)

The same considering also possible duplicates:

df %>% group_by(location) %>% distinct(value, .keep_all = TRUE) %>% filter(dense_rank(value) == 2)

liuminzhao · Accepted Answer · 2019-05-10 21:28:36Z

Using summarize to work with group_by:

> dt %>% group_by(location, year) %>% arrange(value) %>% summarize(value = nth(value, 2)) # A tibble: 2 x 3 # Groups: location [2] location year value <chr> <dbl> <dbl> 1 LA 2026 7 2 NYC 2026 2

Rorschach · Accepted Answer · 2019-05-10 21:50:07Z

Since you only need the 2nd element, a partial sort shouldn't be out of the question. I don't know if it is supported by dplyr or data.table, but it is with base R sorting (with restricted options), eg. something along the lines of

with(dt, lapply(split(dt, interaction(location, year)), function(x) sort.int(x$value, partial=2)))

I doubt it would be faster than dplyr or data.table even if they are sorting fully given their optimizations, but maybe it would be worth looking into if efficiency is a concern.

Oh, you could also sort first, then group and simply select the nth value from each grouping saving the multiple sorting routines for each group.

Collectives™ on Stack Overflow

k_th smallest element per group in R

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related