Applying group_by and summarise on data while keeping all the columns' info

Question

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

 data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med")) d<- data %>% group_by(b) %>% summarise(min_values= min(c)) d b min_values 1 a 1.2 2 b 1.7 3 c 3.1 4 d 2.2

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

Exactly how do you propose the resulting data.frame would look? How would the other data look when compressed into a single row? — r2evans
– r2evans, Commented May 4, 2015 at 7:11

bergant · Accepted Answer · 2015-05-04 07:18:39Z

75

You can use group_by without summarize:

data %>% group_by(b) %>% mutate(min_values = min(c)) %>% ungroup()

answered May 4, 2015 at 7:18

bergant

7,2521 gold badge22 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Momeneh Foroutan Over a year ago

Thank you so much Bergant, the thing is that your method gives me all the rows... but it is important for me to know for example the min value is related to the number 4 in col "a". Docendo's answer below is exactly what I needed. Thanks anyway for your time on answering this :-)

Brian D Over a year ago

this answer. my 'duh' moment of the week.

Karol Daniluk Over a year ago

So simple, yet so powerful.

Katya Over a year ago

@bergant this didn't work for me, the result still shows only 2 columns after I summarise, even though I included ungroup().

Aaron C Over a year ago

"ungroup()" should be replaced with slice(1), this will reduce the rows to one per group & can also work with multiple summary columns

|

talat · Accepted Answer · 2015-05-04 08:09:30Z

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

a)

> data %>% group_by(b) %>% filter(c == min(c)) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med

b)

> data %>% group_by(b) %>% slice(which.min(c)) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med

Thanks a million Docendo for the answer. This is exactly what I was looking for :-)
Exactly what I needed! And I discovered the function slice as a bonus, thx!
What if you are trying to use summarize to get information that is not contained in the original data, and therefore cannot be "filtered"? for example, sum or mean?
Late to the party, but you can still filter by the return of functions. For example, you can do df %>% group_by(x) %>% filter(n() > 10) to filter for groups with more than ten observations, without having assigned n() to any previous column.

mpalanco · Accepted Answer · 2015-07-14 12:15:58Z

Using sqldf:

library(sqldf) # Two options: sqldf('SELECT * FROM data GROUP BY b HAVING min(c)') sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b')

Output:

 a b c d 1 1 a 1.2 small 2 4 b 1.7 larg 3 6 c 3.1 med 4 10 d 2.2 med

Maël · Accepted Answer · 2022-12-15 11:11:55Z

With dplyr 1.1.0, you can use .by in mutate, summarize, filter and slice to do temporary grouping. With mutate, all rows and columns are kept:

data %>% mutate(min_values = min(c), .by = b)

With filter, or slice, rows are summarized and all columns are kept:

data %>% slice_min(c, .by = b) data %>% filter(c = min(c), .by = b)

Collectives™ on Stack Overflow

Applying group_by and summarise on data while keeping all the columns' info

4 Answers 4

6 Comments

4 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

4 Comments

Comments

Comments

Linked

Related