4

I have a data table, dt, that looks like

 location year value NYC 2026 1 NYC 2026 2 NYC 2026 3 NYC 2026 4 NYC 2026 5 LA 2026 6 LA 2026 7 LA 2026 8 LA 2026 9 LA 2026 10 

I would like to group them by city and year and find the 2nd smallest element in the column value, per each group, where the desired result looks like:

 location year value NYC 2026 2 LA 2026 7 

dt %>% grou_by(location, year) %>% nth(value, 2)

Would not work. Any help is appreciated.

The data table above can be created by:

dt <- structure(list(location = c("NYC", "NYC", "NYC","NYC", "NYC", "LA", "LA", "LA", "LA", "LA"), year = c(2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026, 2026), value = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), class = "data.table", row.names = c(NA, -10L)) 
0

3 Answers 3

7

One dplyr possibility could be:

df %>% group_by(location) %>% arrange(value) %>% slice(2) 

Here it groups by "location" column, arranges the values according "value" column and then keeps the second element.

 location year value <chr> <int> <int> 1 LA 2026 7 2 NYC 2026 2 

Or if the values in "value" column could be duplicate, then you can do:

df %>% group_by(location) %>% distinct(value, .keep_all = TRUE) %>% arrange(value) %>% slice(2) 

Or using filter() instead of slice():

df %>% group_by(location) %>% arrange(value) %>% filter(row_number() == 2) 

The same considering also possible duplicates:

df %>% group_by(location) %>% distinct(value, .keep_all = TRUE) %>% arrange(value) %>% filter(row_number() == 2) 

Or using filter() and dense_rank():

df %>% group_by(location) %>% filter(dense_rank(value) == 2) 

The same considering also possible duplicates:

df %>% group_by(location) %>% distinct(value, .keep_all = TRUE) %>% filter(dense_rank(value) == 2) 
Sign up to request clarification or add additional context in comments.

Comments

4

Using summarize to work with group_by:

> dt %>% group_by(location, year) %>% arrange(value) %>% summarize(value = nth(value, 2)) # A tibble: 2 x 3 # Groups: location [2] location year value <chr> <dbl> <dbl> 1 LA 2026 7 2 NYC 2026 2 

Comments

1

Since you only need the 2nd element, a partial sort shouldn't be out of the question. I don't know if it is supported by dplyr or data.table, but it is with base R sorting (with restricted options), eg. something along the lines of

with(dt, lapply(split(dt, interaction(location, year)), function(x) sort.int(x$value, partial=2))) 

I doubt it would be faster than dplyr or data.table even if they are sorting fully given their optimizations, but maybe it would be worth looking into if efficiency is a concern.

Oh, you could also sort first, then group and simply select the nth value from each grouping saving the multiple sorting routines for each group.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.