3
$\begingroup$

I am using the GraphBLAS C API (https://graphblas.org/) which provides an interface for performing mathematical operations on sparse matrices. Given an adjacency matrix $\mathbf{A}: \mathbb{R}^{n \times n}$, I would like to remove duplicate rows from the matrix. Instead of implementing this in the standard $\mathcal{O}(n^3)$ way where each row must be checked against every subsequent row, I would like to try and take advantage of built-in mathematical operations from the API.

For instance in the matrix

$$ \begin{bmatrix} 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix} $$ either the first or third row should be deleted (which one matters not).

Does anyone know of such a mathematical operation that can take $\mathbf{A}$ and multiply/divide/<operate> with either itself or another $n \times n$ matrix (or vector) such that duplicate rows are deleted (but one of the duplicates is kept)?

If this is off-topic, let me know and I can move over to Math SE (although they will probably tell me to bug off :)

$\endgroup$
2
  • 2
    $\begingroup$ The naive approach described is $O(n^3)$. You could insert each row to hash map to find duplicates, this would be $O(n^2)$. $\endgroup$ Commented Jan 19, 2024 at 18:27
  • 1
    $\begingroup$ Right, but I am really trying to find a way to capture the essence of this within a linear algebraic formulation. $\endgroup$ Commented Jan 19, 2024 at 18:35

1 Answer 1

1
$\begingroup$

A more efficient approach is to hash each row, insert them into a hash table, and find duplicates in that way. The running time will be $O(n^2)$.

You can implement something like this using only linear algebra operations. In particular, pick a random column vector $v$. Multiply using your BLAS API to obtain the product $w=Av$. Now if $A_i=A_j$ (i.e., if the $i$th and $j$th rows of $A$ are duplicates), then you will have $w_i=w_j$. So, it suffices to sort the entries of $w$ (which brings all duplicates next to each other), find all duplicates, and check whether the corresponding rows of $A$ are duplicates. This basically uses a linear function as the hash function, so that you can hash all the rows efficiently using your BLAS API.

I suggest that one reasonable way to instantiate this scheme is to pick $k \ge 2\lg n$, choose $v$ so that each entry of $v$ is chosen uniformly at random from $0,1,2,\dots,2^k-1$, and perform all arithmetic modulo $2^k$. With this instantiation, you can expect very few "false alarms" (i.e., very few cases where $w_i=w_j$ but $A_i\ne A_j$). Or, you can pick $k$ in a way that is convenient, and deal with the false alarms. Or, you can replace $v$ with a $n \times 2$ or $n \times 3$ matrix, compute a matrix product, and look for duplicate rows in the resulting $n\times 2$ or $n \times 3$ matrix. There are many possibilities which you can experiment with, depending on the typical size of $n$ and the performance characteristics of your BLAS.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.