Mathematical operation for removing duplicate rows in a matrix

Question

I am using the GraphBLAS C API (https://graphblas.org/) which provides an interface for performing mathematical operations on sparse matrices. Given an adjacency matrix $\mathbf{A}: \mathbb{R}^{n \times n}$, I would like to remove duplicate rows from the matrix. Instead of implementing this in the standard $\mathcal{O}(n^3)$ way where each row must be checked against every subsequent row, I would like to try and take advantage of built-in mathematical operations from the API.

For instance in the matrix

$$ \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix} $$ either the first or third row should be deleted (which one matters not).

Does anyone know of such a mathematical operation that can take $\mathbf{A}$ and multiply/divide/<operate> with either itself or another $n \times n$ matrix (or vector) such that duplicate rows are deleted (but one of the duplicates is kept)?

If this is off-topic, let me know and I can move over to Math SE (although they will probably tell me to bug off :)

The naive approach described is $O(n^3)$. You could insert each row to hash map to find duplicates, this would be $O(n^2)$. — Kenneth Kho
– Kenneth Kho, Commented Jan 19, 2024 at 18:27
Right, but I am really trying to find a way to capture the essence of this within a linear algebraic formulation. — codeing_monkey
– codeing_monkey, Commented Jan 19, 2024 at 18:35

D.W. · Accepted Answer · 2024-01-19 21:08:32Z

A more efficient approach is to hash each row, insert them into a hash table, and find duplicates in that way. The running time will be $O(n^2)$.

You can implement something like this using only linear algebra operations. In particular, pick a random column vector $v$. Multiply using your BLAS API to obtain the product $w=Av$. Now if $A_i=A_j$ (i.e., if the $i$th and $j$th rows of $A$ are duplicates), then you will have $w_i=w_j$. So, it suffices to sort the entries of $w$ (which brings all duplicates next to each other), find all duplicates, and check whether the corresponding rows of $A$ are duplicates. This basically uses a linear function as the hash function, so that you can hash all the rows efficiently using your BLAS API.

I suggest that one reasonable way to instantiate this scheme is to pick $k \ge 2\lg n$, choose $v$ so that each entry of $v$ is chosen uniformly at random from $0,1,2,\dots,2^k-1$, and perform all arithmetic modulo $2^k$. With this instantiation, you can expect very few "false alarms" (i.e., very few cases where $w_i=w_j$ but $A_i\ne A_j$). Or, you can pick $k$ in a way that is convenient, and deal with the false alarms. Or, you can replace $v$ with a $n \times 2$ or $n \times 3$ matrix, compute a matrix product, and look for duplicate rows in the resulting $n\times 2$ or $n \times 3$ matrix. There are many possibilities which you can experiment with, depending on the typical size of $n$ and the performance characteristics of your BLAS.

Stack Exchange Network

Mathematical operation for removing duplicate rows in a matrix

1 Answer 1

Hot Network Questions

Mathematical operation for removing duplicate rows in a matrix

1 Answer 1

Related

Hot Network Questions