Checking whether sets of columns are the same, row wise in R, in any order

Question

I am working in R, and would prefer a dplyr solution if possible.

sample data:

data.frame( col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"), col3 = rep("a", 4L), col4 = c("a", "b", "d", "a"), col5 = c("a", "a", "c", "d"), col6 = rep(c("b", "a"), each = 2L) )

col1	col2	col3	col4	col5	col6
a	a	a	a	a	b
b	b	a	b	a	b
c	d	a	d	c	a
d	a	a	a	d	a

Question

I would like to know for each row, whether col1, col2 and col3 are the same as col4, col5 and col6, but the order of col1 - col3 and col4 - col6 should be ignored.

So for row 1, if col1 - col3 contained a,a,b respectively, and col4 - col6 contained b,a,a respectively, then that would be considered a match.

Desired result

Have put a note on "assessment" column to aid understanding

col1	col2	col3	col4	col5	col6	assessment
a	a	a	a	a	b	FALSE (because 1-3 are not same as 4-6)
b	b	a	b	a	b	TRUE (because 1-3 are the same as 4-6, if ignore order)
c	d	a	d	c	a	TRUE (because 1-3 are the same as 4-6, if ignore order)
d	a	a	a	d	a	TRUE (because 1-3 are the same as 4-6, if ignore order)

s_baldur · Accepted Answer · 2025-11-01 08:45:31Z

Base R:

df$assessment <- apply(df, 1, \(x) identical(table(x[1:3]), table(x[4:6]))) # col1 col2 col3 col4 col5 col6 assessment # 1 a a a a a b FALSE # 2 b b a b a b TRUE # 3 c d a d c a TRUE # 4 d a a a d a TRUE

Reproducible data:

df <- data.frame( col1 = c("a", "b", "c", "d"), col2 = c("a", "b", "d", "a"), col3 = c("a", "a", "a", "a"), col4 = c("a", "b", "d", "a"), col5 = c("a", "a", "c", "d"), col6 = c("b", "b", "a", "a") )

PS: Why table() and indentical instead of sort(), ==, all()? I would expect it to scale better with the number of columns (given low number of unique values). Example:

df <- as.data.frame(lapply(1:600, \(x) sample(letters, size = 4000, replace = TRUE))) bench::mark( apply(df, 1, \(x) identical(table(x[1:300]), table(x[301:600]))), apply(df, 1, \(x) all(sort(x[1:300]) == sort(x[301:600]))) ) # expression min median `itr/sec` mem_alloc # <bch:expr> <bch> <bch:> <dbl> <bch:byt> # 1 apply(df, 1, function(x) identical(table(x[1:300]), table… 1.68s 1.68s 0.594 333MB # 2 apply(df, 1, function(x) all(sort(x[1:300]) == sort(x[301… 9.01s 9.01s 0.111 191MB

PS 2: Replacing table(x) with collapse::fcount(x, sort = TRUE) gives a further speedup.

Why the preference for isTRUE(all.equal(..)) instead of identical? Just curious ... I can see isTRUE(all.equal(.., check.attributes=FALSE)) in general (not relevant for this question), but otherwise won't it return the same with slightly less overhead?

Isaac · Accepted Answer · 2024-01-08 13:42:55Z

Using dplyr you can do the following:

df %>% rowwise() %>% mutate(result = all(sort(c_across(col1:col3)) == sort(c_across(col4:col6)))) # A tibble: 4 × 7 # Rowwise: col1 col2 col3 col4 col5 col6 result <chr> <chr> <chr> <chr> <chr> <chr> <lgl> 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE

FWW rowwise() tends to be much slower than purrr:pmap() - see here. You could do use the same approach with pmap(), i.e. df |> mutate( assessment = purrr::pmap(df, \(col1, col2, col3, col4, col5, col6) all(sort(c(col4, col5, col6)) == sort(c(col1, col2, col3))))).

jpsmith · Accepted Answer · 2024-01-08 13:46:34Z

In base R you could use vapply and sort:

df$assessment <- vapply(seq_len(nrow(df)), \(x) all(sort(unlist(df[x,1:3])) == sort(unlist(df[x,4:6]))), logical(1))

Output:

# col1 col2 col3 col4 col5 col6 assessment # 1 a a a a a b FALSE # 2 b b a b a b TRUE # 3 c d a d c a TRUE # 4 d a a a d a TRUE

SamR · Accepted Answer · 2024-01-08 14:09:01Z

Avoid iterating over rows

Generally iterating over rows is slow, and I have found dplyr::rowwise() approaches become very slow with more than a few thousand rows. It tends to be faster to use purrr::pmap() and much faster to iterate over columns.

base R approach

You could take the transpose of the relevant columns and iterate over the columns of that.

cols <- paste0("col", 1:6) df$assessment <- df[cols] |> t() |> data.frame() |> sapply(\(x) all(sort(x[1:3]) == sort(x[4:6]))) # col1 col2 col3 col4 col5 col6 assessment # 1 a a a a a b FALSE # 2 b b a b a b TRUE # 3 c d a d c a TRUE # 4 d a a a d a TRUE

`tidyverse` approach: convert from wide to long

Alternatively if you want to remain in the tidyverse you could convert from wide to long:

df %>% mutate( assessment = . |> mutate(id = row_number()) |> tidyr::pivot_longer( -id, names_to = "col", names_transform = readr::parse_number ) |> group_by(id) |> summarise( assessment = all( sort( value[col %in% 1:3] ) == sort( value[col %in% 4:6] ) ) ) |> pull( assessment ) )

This is more verbose but I suspect it will be considerably faster with any reasonably-sized dataset.

iroha · Accepted Answer · 2024-01-08 14:25:57Z

As has been noted you should avoid rowwise operations. Here's an alternative that compares sets using a helper function that efficiently sorts by row so that comparisons are completely vectorized.

library(dplyr) f <- function(set1, set2) { s1 <- as.matrix(pick({{set1}})) s2 <- as.matrix(pick({{set2}})) row_sort <- function(m) matrix(m[order(row(m), m)], ncol = ncol(m), byrow = TRUE) !rowSums(row_sort(s1) != row_sort(s2)) > 0 } dat %>% mutate(assessment = f(col1:col3, col4:col6)) col1 col2 col3 col4 col5 col6 assessment 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE

Don't want to post it because it's not a distinct solution, but a slightly more compact solution could be f <- function(set1, set2) { set1_sorted <- apply(as.matrix(pick({{set1}})), 1, sort) set2_sorted <- apply(as.matrix(pick({{set2}})), 1, sort) colMeans(set1_sorted == set2_sorted) == 1 }.
It's more compact but it's sorting each individual row and is much less efficient. My solution sorts the matrix by row in a single call, not a call for each row.

tmfmnk · Accepted Answer · 2024-01-08 16:25:29Z

A dplyr and vecsets option could be:

df %>% rowwise() %>% mutate(cond = vsetequal(c_across(col1:col3), c_across(col4:col6), multiple = TRUE)) col1 col2 col3 col4 col5 col6 cond <chr> <chr> <chr> <chr> <chr> <chr> <lgl> 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE

The same idea with purrr::pmap():

df %>% mutate(cond = pmap_lgl(across(col1:col6), ~ vsetequal(c(...)[1:3], c(...)[4:6], multiple = TRUE)))

This is an unreasonably inefficient solution, but out of curiosity:

df %>% rowwise() %>% mutate(cond = toString(sort(c_across(col1:col3))) == toString(sort(c_across(col4:col6))))

The same idea with purrr:pmap():

df %>% mutate(cond = pmap_lgl(across(col1:col6), ~ toString(sort(c(...)[1:3])) == toString(sort(c(...)[4:6]))))

Using the transpose logic from @SamR with vecsets:

df %>% mutate(cond = map2_lgl(.x = across(col1:col3) %>% t() %>% data.frame(), .y = across(col4:col6) %>% t() %>% data.frame(), vsetequal))

The same approach using data.table::transpose():

df %>% mutate(cond = map2_lgl(.x = data.table::transpose(across(col1:col3)), .y = data.table::transpose(across(col4:col6)), vsetequal))

ThomasIsCoding · Accepted Answer · 2024-01-08 14:33:51Z

Try the code below with split.default + colMeans

df$assessment <- colMeans( do.call( `==`, lapply(split.default( df, grepl("[1-3]$", names(df)) ), \(d) apply(d, 1, sort)) ) ) == 1

which should give

> df col1 col2 col3 col4 col5 col6 assessment 1 a a a a a b FALSE 2 b b a b a b TRUE 3 c d a d c a TRUE 4 d a a a d a TRUE

TarJae · Accepted Answer · 2024-01-08 15:41:33Z

This one is quite verbose, but I can't resist. Here is one with pivoting:

library(dplyr) library(tidyr) df %>% pivot_longer(cols = starts_with("col"), names_to = "col_set") %>% group_by(group = (row_number() - 1) %/% ncol(df) + 1) %>% mutate(x = lead(value, 3)) %>% na.omit() %>% mutate(across(c(value, x), ~sort(.))) %>% summarize(check = all(value == x), .groups = "drop") %>% bind_cols(df) %>% select(-group) # A tibble: 4 × 7 check col1 col2 col3 col4 col5 col6 <lgl> <chr> <chr> <chr> <chr> <chr> <chr> 1 FALSE a a a a a b 2 TRUE b b a b a b 3 TRUE c d a d c a 4 TRUE d a a a d a

Collectives™ on Stack Overflow

Checking whether sets of columns are the same, row wise in R, in any order

8 Answers 8

1 Comment

1 Comment

Comments

Avoid iterating over rows

base R approach

`tidyverse` approach: convert from wide to long

Comments

2 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

1 Comment

1 Comment

Comments

Avoid iterating over rows

base R approach

tidyverse approach: convert from wide to long

Comments

2 Comments

Comments

Comments

Comments

Linked

Related

`tidyverse` approach: convert from wide to long