How to test if all rows are distinct in numpy [duplicate]

Question

In numpy, is there a nice idiomatic way of testing if all rows are distinct in a 2d array?

I thought I could do

len(np.unique(arr)) == len(arr)

but this doesn't work at all. For example,

arr = np.array([[1,2,3],[1,2,4]]) np.unique(arr) Out[4]: array([1, 2, 3, 4])

Note: stackoverflow.com/questions/16970982/… is about FINDING the unique row, OP is about TESTING if the rows are all unique. Different questions. — CT Zhu
– CT Zhu, Commented Oct 2, 2014 at 17:10
Several interesting answers to how to drop nonunique rows/columns: mail.scipy.org/pipermail/scipy-user/2011-December/031193.html. You can then just see if the reduced array is the same as the original. If you use pandas, there is an efficient implementation to do such a thing. — wflynny
– wflynny, Commented Oct 2, 2014 at 17:15
Finding Unique rows would essentially be the same thing as seeing if each row is unique. — GWW
– GWW, Commented Oct 2, 2014 at 19:00
@GWW I think the point is the answer in the linked question might be overkill for the testing problem. In other words there might be a simpler and faster solution to this problem. — Simd
– Simd, Commented Oct 3, 2014 at 15:17

CT Zhu · Accepted Answer · 2014-10-02 19:14:49Z

You can calculate the correlation matrix and ask if only the diagonal elements are 1:

(np.corrcoef(M)==1).sum()==M.shape[0] In [66]: M = np.random.random((5,8)) In [72]: (np.corrcoef(M)==1).sum()==M.shape[0] Out[72]: True

This if you want to do a similar thing for the columns:

(np.corrcoef(M, rowvar=0)==1).sum()==M.shape[1]

or without numpy at all:

len(set(map(tuple,M)))==len(M)

Fiter out the unique rows and then test if the resultant is same as M is an overkill:

In [99]: %%timeit b = np.ascontiguousarray(M).view(np.dtype((np.void, M.dtype.itemsize * M.shape[1]))) _, idx = np.unique(b, return_index=True) unique_M = M[idx] unique_M.shape==M.shape 10000 loops, best of 3: 54.6 µs per loop In [100]: %timeit len(set(map(tuple,M)))==len(M) 10000 loops, best of 3: 24.9 µs per loop

Thank you very much for this. It's surprising that a non-numpy way is the fastest. Doesn't it have to convert numpy array -> tuple -> set ?
Pure python FTW! If there are many more rows than cols, can try len(set(tuple(zip(*M.T)))) == len(M) it might be faster.

Collectives™ on Stack Overflow

How to test if all rows are distinct in numpy [duplicate]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related