5

In numpy, is there a nice idiomatic way of testing if all rows are distinct in a 2d array?

I thought I could do

len(np.unique(arr)) == len(arr) 

but this doesn't work at all. For example,

arr = np.array([[1,2,3],[1,2,4]]) np.unique(arr) Out[4]: array([1, 2, 3, 4]) 
5
  • Note: stackoverflow.com/questions/16970982/… is about FINDING the unique row, OP is about TESTING if the rows are all unique. Different questions. Commented Oct 2, 2014 at 17:10
  • Several interesting answers to how to drop nonunique rows/columns: mail.scipy.org/pipermail/scipy-user/2011-December/031193.html. You can then just see if the reduced array is the same as the original. If you use pandas, there is an efficient implementation to do such a thing. Commented Oct 2, 2014 at 17:15
  • @GWW Isn't the question different as CT Zhu pointed out? Commented Oct 2, 2014 at 17:53
  • Finding Unique rows would essentially be the same thing as seeing if each row is unique. Commented Oct 2, 2014 at 19:00
  • @GWW I think the point is the answer in the linked question might be overkill for the testing problem. In other words there might be a simpler and faster solution to this problem. Commented Oct 3, 2014 at 15:17

1 Answer 1

-1

You can calculate the correlation matrix and ask if only the diagonal elements are 1:

(np.corrcoef(M)==1).sum()==M.shape[0] In [66]: M = np.random.random((5,8)) In [72]: (np.corrcoef(M)==1).sum()==M.shape[0] Out[72]: True 

This if you want to do a similar thing for the columns:

(np.corrcoef(M, rowvar=0)==1).sum()==M.shape[1]

or without numpy at all:

len(set(map(tuple,M)))==len(M) 

Fiter out the unique rows and then test if the resultant is same as M is an overkill:

In [99]: %%timeit b = np.ascontiguousarray(M).view(np.dtype((np.void, M.dtype.itemsize * M.shape[1]))) _, idx = np.unique(b, return_index=True) unique_M = M[idx] unique_M.shape==M.shape 10000 loops, best of 3: 54.6 µs per loop In [100]: %timeit len(set(map(tuple,M)))==len(M) 10000 loops, best of 3: 24.9 µs per loop 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much for this. It's surprising that a non-numpy way is the fastest. Doesn't it have to convert numpy array -> tuple -> set ?
Pure python FTW! If there are many more rows than cols, can try len(set(tuple(zip(*M.T)))) == len(M) it might be faster.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.