95

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

 data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med")) d<- data %>% group_by(b) %>% summarise(min_values= min(c)) d b min_values 1 a 1.2 2 b 1.7 3 c 3.1 4 d 2.2 

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

1
  • 2
    Exactly how do you propose the resulting data.frame would look? How would the other data look when compressed into a single row? Commented May 4, 2015 at 7:11

4 Answers 4

75

You can use group_by without summarize:

data %>% group_by(b) %>% mutate(min_values = min(c)) %>% ungroup() 
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you so much Bergant, the thing is that your method gives me all the rows... but it is important for me to know for example the min value is related to the number 4 in col "a". Docendo's answer below is exactly what I needed. Thanks anyway for your time on answering this :-)
this answer. my 'duh' moment of the week.
So simple, yet so powerful.
@bergant this didn't work for me, the result still shows only 2 columns after I summarise, even though I included ungroup().
"ungroup()" should be replaced with slice(1), this will reduce the rows to one per group & can also work with multiple summary columns
|
63

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

a)

> data %>% group_by(b) %>% filter(c == min(c)) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med 

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med 

b)

> data %>% group_by(b) %>% slice(which.min(c)) #Source: local data frame [4 x 4] #Groups: b # # a b c d #1 1 a 1.2 small #2 4 b 1.7 larg #3 6 c 3.1 med #4 10 d 2.2 med 

4 Comments

Thanks a million Docendo for the answer. This is exactly what I was looking for :-)
Exactly what I needed! And I discovered the function slice as a bonus, thx!
What if you are trying to use summarize to get information that is not contained in the original data, and therefore cannot be "filtered"? for example, sum or mean?
Late to the party, but you can still filter by the return of functions. For example, you can do df %>% group_by(x) %>% filter(n() > 10) to filter for groups with more than ten observations, without having assigned n() to any previous column.
3

Using sqldf:

library(sqldf) # Two options: sqldf('SELECT * FROM data GROUP BY b HAVING min(c)') sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b') 

Output:

 a b c d 1 1 a 1.2 small 2 4 b 1.7 larg 3 6 c 3.1 med 4 10 d 2.2 med 

Comments

2

With dplyr 1.1.0, you can use .by in mutate, summarize, filter and slice to do temporary grouping. With mutate, all rows and columns are kept:

data %>% mutate(min_values = min(c), .by = b) 

With filter, or slice, rows are summarized and all columns are kept:

data %>% slice_min(c, .by = b) data %>% filter(c = min(c), .by = b) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.