3

I would like to group my data frame by a variable, summarize another variable, but keep all other associated columns.

In Applying group_by and summarise on data while keeping all the columns' info the accepted answer is to use filter() or slice(), which works fine if the answer exists in the data already (i.e. min, max) but this doesn't work if you would like to use a function that generates a new answer (i.e. sum, mean).

In Applying group_by and summarise(sum) but keep columns with non-relevant conflicting data? the accepted answer is to use all the the columns you would like to keep as part of the grouping variable. But this seems like an ineffective solution if you have many columns you would like to keep. For example, the data I'm working with has 26 additional columns.

The best solution I've come up with is to split-apply-combine. But this seems clunky - surely there must be a solution that can be done in a single pipeline.

Example:

location <- c("A", "A", "B", "B", "C", "C") date <- c("1", "2", "1", "2", "1", "2") count <- c(3, 6, 4, 2, 7, 5) important_1 <- c(1,1,2,2,3,3) important_30 <- c(4,4,5,5,6,6) df <- data.frame(location = location, date = date, count = count, important_1 = important_1, important_30 = important_30) 

I want to summarize the counts that happened on different dates at the same location. I want to keep all the important (imagine there are 30 instead of 2).

My solution so far:

check <- df %>% group_by(location) %>% summarise(count = sum(count)) add2 <- df %>% select(-count, -date) %>% distinct() results <- merge(check, add2) 

Is there a way I could accomplish this in a single pipeline? I'd rather keep it organized and avoid creating new objects if possible.

1
  • 1
    important_2 is not defined Commented Jun 22, 2020 at 21:10

1 Answer 1

3

We can create a column with mutate and then apply distinct

library(dplyr) df %>% group_by(location) %>% mutate(count = sum(count)) %>% select(-date) %>% distinct(location, important_1, important_30, .keep_all = TRUE) 

If there are multiple column names, we can also use syms to convert to symbol and evaluate (!!!)

df %>% group_by(location) %>% mutate(count = sum(count)) %>% select(-date) %>% distinct(location, !!! rlang::syms(names(.)[startsWith(names(.), 'important')]), .keep_all = TRUE) 
Sign up to request clarification or add additional context in comments.

3 Comments

The first solution would work well I think. The second solution wouldn't work well with my actual data because the "important" columns don't actually have the word important in them. I tried to make my example simple, but perhaps I should have made it more similar to the real data
@canderson156 In that case you can use index to subset the names i.e. rlang::syms(names(.)[c(1, 2, 3, ...)]
@canderson156 also, perhaps you can do group_by_at i.e. df %>% group_by_at("location", names(.)[4:5)) %>% summarise(count = sum(count))

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.