Applying group_by and summarise(sum) but keep a large number of additional columns

Question

I would like to group my data frame by a variable, summarize another variable, but keep all other associated columns.

In Applying group_by and summarise on data while keeping all the columns' info the accepted answer is to use filter() or slice(), which works fine if the answer exists in the data already (i.e. min, max) but this doesn't work if you would like to use a function that generates a new answer (i.e. sum, mean).

In Applying group_by and summarise(sum) but keep columns with non-relevant conflicting data? the accepted answer is to use all the the columns you would like to keep as part of the grouping variable. But this seems like an ineffective solution if you have many columns you would like to keep. For example, the data I'm working with has 26 additional columns.

The best solution I've come up with is to split-apply-combine. But this seems clunky - surely there must be a solution that can be done in a single pipeline.

Example:

location <- c("A", "A", "B", "B", "C", "C") date <- c("1", "2", "1", "2", "1", "2") count <- c(3, 6, 4, 2, 7, 5) important_1 <- c(1,1,2,2,3,3) important_30 <- c(4,4,5,5,6,6) df <- data.frame(location = location, date = date, count = count, important_1 = important_1, important_30 = important_30)

I want to summarize the counts that happened on different dates at the same location. I want to keep all the important (imagine there are 30 instead of 2).

My solution so far:

check <- df %>% group_by(location) %>% summarise(count = sum(count)) add2 <- df %>% select(-count, -date) %>% distinct() results <- merge(check, add2)

Is there a way I could accomplish this in a single pipeline? I'd rather keep it organized and avoid creating new objects if possible.

important_2 is not defined

akrun
– akrun

2020-06-22 21:10:23 +00:00
Commented Jun 22, 2020 at 21:10 — akrun
– akrun, Commented Jun 22, 2020 at 21:10

akrun · Accepted Answer · 2020-06-22 21:13:15Z

We can create a column with mutate and then apply distinct

library(dplyr) df %>% group_by(location) %>% mutate(count = sum(count)) %>% select(-date) %>% distinct(location, important_1, important_30, .keep_all = TRUE)

If there are multiple column names, we can also use syms to convert to symbol and evaluate (!!!)

df %>% group_by(location) %>% mutate(count = sum(count)) %>% select(-date) %>% distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE)

The first solution would work well I think. The second solution wouldn't work well with my actual data because the "important" columns don't actually have the word important in them. I tried to make my example simple, but perhaps I should have made it more similar to the real data
@canderson156 In that case you can use index to subset the names i.e. rlang::syms(names(.)[c(1, 2, 3, ...)]
@canderson156 also, perhaps you can do group_by_at i.e. df %>% group_by_at("location", names(.)[4:5)) %>% summarise(count = sum(count))

Collectives™ on Stack Overflow

Applying group_by and summarise(sum) but keep a large number of additional columns

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related