Filtering rows in a group based on group properties

Question

Suppose I have a tibble with a grouping variable and a logical variable that indicates whether a row is a primary response for that group.

I want to do the following:

If any row in a group is marked as is_primary keep that row but none of the others in the group
If no row in group is marked with is_primary, keep them all
Filter the rows based on the above

Here is some example data:

library(tidyverse) data <- tibble(group=c("A","A","A","B","B","C","C","C","C"), is_primary=c(FALSE, FALSE, FALSE,FALSE,TRUE,FALSE,FALSE,TRUE,TRUE), value=c(1,2,3,4,5,6,7,8,9))

In the above example, I'd like to keep all the A rows, because there is no row with is_primary==TRUE, keep only the second B row, and keep the last two C rows.

I thought the obvious solution would be something like:

data %>% group_by(group) %>% mutate(keep_row=ifelse(any(is_primary),is_primary,TRUE))

But this results in the following, which doesn't meet the criteria above.

# A tibble: 9 x 4 # Groups: group [3] group is_primary value keep_row <chr> <lgl> <dbl> <lgl> 1 A FALSE 1 TRUE 2 A FALSE 2 TRUE 3 A FALSE 3 TRUE 4 B FALSE 4 FALSE 5 B TRUE 5 FALSE 6 C FALSE 6 FALSE 7 C FALSE 7 FALSE 8 C TRUE 8 FALSE 9 C TRUE 9 FALSE

However, if I make an intermediary variable that indicated whether the group has a primary key it works.

data %>% group_by(group) %>% mutate(has_primary=ifelse(any(is_primary),TRUE,FALSE)) %>% mutate(keep_row=ifelse(has_primary,is_primary,TRUE))

This results in keep_row being correct:

# A tibble: 9 x 5 # Groups: group [3] group is_primary value has_primary keep_row <chr> <lgl> <dbl> <lgl> <lgl> 1 A FALSE 1 FALSE TRUE 2 A FALSE 2 FALSE TRUE 3 A FALSE 3 FALSE TRUE 4 B FALSE 4 TRUE FALSE 5 B TRUE 5 TRUE TRUE 6 C FALSE 6 TRUE FALSE 7 C FALSE 7 TRUE FALSE 8 C TRUE 8 TRUE TRUE 9 C TRUE 9 TRUE TRUE

What is going on in ifelse that the first solution doesn't work?

Community · Accepted Answer · 2020-06-20 09:12:55Z

We can use an if/else condition to return the rows when there is no TRUE element in 'is_primary' or else return only the rows where 'is_primary' is TRUE

library(dplyr) data %>% group_by(group) %>% filter(if(!any(is_primary)) TRUE else is_primary) # A tibble: 6 x 3 # Groups: group [3] # group is_primary value # <chr> <lgl> <dbl> #1 A FALSE 1 #2 A FALSE 2 #3 A FALSE 3 #4 B TRUE 5 #5 C TRUE 8 #6 C TRUE 9

It can be also done with a | condition

data %>% group_by(group) %>% filter(!any(is_primary) | is_primary) # A tibble: 6 x 3 # Groups: group [3] # group is_primary value # <chr> <lgl> <dbl> #1 A FALSE 1 #2 A FALSE 2 #3 A FALSE 3 #4 B TRUE 5 #5 C TRUE 8 #6 C TRUE 9

Or another option is

data %>% group_by(group) %>% filter(sum(is_primary) == 0 | is_primary) # A tibble: 6 x 3 # Groups: group [3] # group is_primary value # <chr> <lgl> <dbl> #1 A FALSE 1 #2 A FALSE 2 #3 A FALSE 3 #4 B TRUE 5 #5 C TRUE 8 #6 C TRUE 9

Or using slice

data %>% group_by(group) %>% slice(if(!any(is_primary)) row_number() else which(is_primary))

A data.table option of the above would be

library(data.table) setDT(data)[data[, .I[!any(is_primary)|is_primary], by = group]$V1]

Or using base R

data[with(data, !ave(is_primary, group, FUN = any) | is_primary),]

The issue with ifelse is that according to ?ifelse

ifelse(test, yes, no)

If yes or no are too short, their elements are recycled. yes will be evaluated if and only if any element of test is true, and analogously for no.

In the OP's code

 ifelse(any(is_primary),TRUE,FALSE)

any returns a logical vector of length 1. According to ?any

The value is a logical vector of length one.

Based on the ifelse documentation above, these values are recycled

This is demonstrating that eitehr (a) the dplyr::if function is different than the base if function, or (b) that group_by is an implicit looping control construct.
Thanks, this is pretty complete. I guess I was surprised that the condition in ifelse controls length. To me, ifelse(TRUE,c(1,2),c(3,4)) returning [1] 1 is unexpected behavior. So it's not really about ifelse recycling values, but rather truncating to the length of the condition.

MrFlick · Accepted Answer · 2019-07-02 22:07:57Z

You problem is that ifelse() returns a vector that's the length of the the input. When you pass ifelse(any(),...) that any() will only return a single vector that's repeated for the group. You can see that with

x <- c(F,T,F,T, F) ifelse(any(x), x, TRUE) # [1] FALSE

Notice how only one value is returned. An ifelse() is not just a shortcut for a proper if \ else statement. It is a vectorized function so be careful not to use it when you are trying to conditionally execute code in a non-vectorized way.

Another way to express your filter would be

data %>% group_by(group) %>% filter(any(is_primary) & is_primary | !any(is_primary))

Collectives™ on Stack Overflow

Filtering rows in a group based on group properties

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related