17

I am working in R, and would prefer a dplyr solution if possible.

sample data:

data.frame( col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"), col3 = rep("a", 4L), col4 = c("a", "b", "d", "a"), col5 = c("a", "a", "c", "d"), col6 = rep(c("b", "a"), each = 2L) ) 
col1 col2 col3 col4 col5 col6
a a a a a b
b b a b a b
c d a d c a
d a a a d a

Question

I would like to know for each row, whether col1, col2 and col3 are the same as col4, col5 and col6, but the order of col1 - col3 and col4 - col6 should be ignored.

So for row 1, if col1 - col3 contained a,a,b respectively, and col4 - col6 contained b,a,a respectively, then that would be considered a match.

Desired result

Have put a note on "assessment" column to aid understanding

col1 col2 col3 col4 col5 col6 assessment
a a a a a b FALSE (because 1-3 are not same as 4-6)
b b a b a b TRUE (because 1-3 are the same as 4-6, if ignore order)
c d a d c a TRUE (because 1-3 are the same as 4-6, if ignore order)
d a a a d a TRUE (because 1-3 are the same as 4-6, if ignore order)
0

8 Answers 8

16

Base R:

df$assessment <- apply(df, 1, \(x) identical(table(x[1:3]), table(x[4:6]))) # col1 col2 col3 col4 col5 col6 assessment # 1 a a a a a b FALSE # 2 b b a b a b TRUE # 3 c d a d c a TRUE # 4 d a a a d a TRUE 

Reproducible data:

df <- data.frame( col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"), col3 = c("a", "a", "a", "a"), col4 = c("a", "b", "d", "a"), col5 = c("a", "a", "c", "d"), col6 = c("b", "b", "a", "a") ) 

PS: Why table() and indentical instead of sort(), ==, all()? I would expect it to scale better with the number of columns (given low number of unique values). Example:

df <- as.data.frame(lapply(1:600, \(x) sample(letters, size = 4000, replace = TRUE))) bench::mark( apply(df, 1, \(x) identical(table(x[1:300]), table(x[301:600]))), apply(df, 1, \(x) all(sort(x[1:300]) == sort(x[301:600]))) ) # expression min median `itr/sec` mem_alloc # <bch:expr> <bch> <bch:> <dbl> <bch:byt> # 1 apply(df, 1, function(x) identical(table(x[1:300]), table… 1.68s 1.68s 0.594 333MB # 2 apply(df, 1, function(x) all(sort(x[1:300]) == sort(x[301… 9.01s 9.01s 0.111 191MB 

PS 2: Replacing table(x) with collapse::fcount(x, sort = TRUE) gives a further speedup.

Sign up to request clarification or add additional context in comments.

1 Comment

Why the preference for isTRUE(all.equal(..)) instead of identical? Just curious ... I can see isTRUE(all.equal(.., check.attributes=FALSE)) in general (not relevant for this question), but otherwise won't it return the same with slightly less overhead?
12

Using dplyr you can do the following:

df %>% rowwise() %>% mutate(result = all(sort(c_across(col1:col3)) == sort(c_across(col4:col6)))) # A tibble: 4 × 7 # Rowwise: col1 col2 col3 col4 col5 col6 result <chr> <chr> <chr> <chr> <chr> <chr> <lgl> 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE 

1 Comment

FWW rowwise() tends to be much slower than purrr:pmap() - see here. You could do use the same approach with pmap(), i.e. df |> mutate( assessment = purrr::pmap(df, \(col1, col2, col3, col4, col5, col6) all(sort(c(col4, col5, col6)) == sort(c(col1, col2, col3))))).
7

In base R you could use vapply and sort:

df$assessment <- vapply(seq_len(nrow(df)), \(x) all(sort(unlist(df[x,1:3])) == sort(unlist(df[x,4:6]))), logical(1)) 

Output:

# col1 col2 col3 col4 col5 col6 assessment # 1 a a a a a b FALSE # 2 b b a b a b TRUE # 3 c d a d c a TRUE # 4 d a a a d a TRUE 

Comments

7

Avoid iterating over rows

Generally iterating over rows is slow, and I have found dplyr::rowwise() approaches become very slow with more than a few thousand rows. It tends to be faster to use purrr::pmap() and much faster to iterate over columns.

base R approach

You could take the transpose of the relevant columns and iterate over the columns of that.

cols <- paste0("col", 1:6) df$assessment <- df[cols] |> t() |> data.frame() |> sapply(\(x) all(sort(x[1:3]) == sort(x[4:6]))) # col1 col2 col3 col4 col5 col6 assessment # 1 a a a a a b FALSE # 2 b b a b a b TRUE # 3 c d a d c a TRUE # 4 d a a a d a TRUE 

tidyverse approach: convert from wide to long

Alternatively if you want to remain in the tidyverse you could convert from wide to long:

df %>% mutate( assessment = . |> mutate(id = row_number()) |> tidyr::pivot_longer( -id, names_to = "col", names_transform = readr::parse_number ) |> group_by(id) |> summarise( assessment = all( sort( value[col %in% 1:3] ) == sort( value[col %in% 4:6] ) ) ) |> pull( assessment ) ) 

This is more verbose but I suspect it will be considerably faster with any reasonably-sized dataset.

Comments

7

As has been noted you should avoid rowwise operations. Here's an alternative that compares sets using a helper function that efficiently sorts by row so that comparisons are completely vectorized.

library(dplyr) f <- function(set1, set2) { s1 <- as.matrix(pick({{set1}})) s2 <- as.matrix(pick({{set2}})) row_sort <- function(m) matrix(m[order(row(m), m)], ncol = ncol(m), byrow = TRUE) !rowSums(row_sort(s1) != row_sort(s2)) > 0 } dat %>% mutate(assessment = f(col1:col3, col4:col6)) col1 col2 col3 col4 col5 col6 assessment 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE 

2 Comments

Don't want to post it because it's not a distinct solution, but a slightly more compact solution could be f <- function(set1, set2) { set1_sorted <- apply(as.matrix(pick({{set1}})), 1, sort) set2_sorted <- apply(as.matrix(pick({{set2}})), 1, sort) colMeans(set1_sorted == set2_sorted) == 1 }.
It's more compact but it's sorting each individual row and is much less efficient. My solution sorts the matrix by row in a single call, not a call for each row.
7

A dplyr and vecsets option could be:

df %>% rowwise() %>% mutate(cond = vsetequal(c_across(col1:col3), c_across(col4:col6), multiple = TRUE)) col1 col2 col3 col4 col5 col6 cond <chr> <chr> <chr> <chr> <chr> <chr> <lgl> 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE 

The same idea with purrr::pmap():

df %>% mutate(cond = pmap_lgl(across(col1:col6), ~ vsetequal(c(...)[1:3], c(...)[4:6], multiple = TRUE))) 

This is an unreasonably inefficient solution, but out of curiosity:

df %>% rowwise() %>% mutate(cond = toString(sort(c_across(col1:col3))) == toString(sort(c_across(col4:col6)))) 

The same idea with purrr:pmap():

df %>% mutate(cond = pmap_lgl(across(col1:col6), ~ toString(sort(c(...)[1:3])) == toString(sort(c(...)[4:6])))) 

Using the transpose logic from @SamR with vecsets:

df %>% mutate(cond = map2_lgl(.x = across(col1:col3) %>% t() %>% data.frame(), .y = across(col4:col6) %>% t() %>% data.frame(), vsetequal)) 

The same approach using data.table::transpose():

df %>% mutate(cond = map2_lgl(.x = data.table::transpose(across(col1:col3)), .y = data.table::transpose(across(col4:col6)), vsetequal)) 

Comments

5

Try the code below with split.default + colMeans

df$assessment <- colMeans( do.call( `==`, lapply(split.default( df, grepl("[1-3]$", names(df)) ), \(d) apply(d, 1, sort)) ) ) == 1 

which should give

> df col1 col2 col3 col4 col5 col6 assessment 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE 

Comments

5

This one is quite verbose, but I can't resist. Here is one with pivoting:

library(dplyr) library(tidyr) df %>% pivot_longer(cols = starts_with("col"), names_to = "col_set") %>% group_by(group = (row_number() - 1) %/% ncol(df) + 1) %>% mutate(x = lead(value, 3)) %>% na.omit() %>% mutate(across(c(value, x), ~sort(.))) %>% summarize(check = all(value == x), .groups = "drop") %>% bind_cols(df) %>% select(-group) # A tibble: 4 × 7 check col1 col2 col3 col4 col5 col6 <lgl> <chr> <chr> <chr> <chr> <chr> <chr> 1 FALSE a a a a a b 2 TRUE b b a b a b 3 TRUE c d a d c a 4 TRUE d a a a d a 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.