5

I'm using Python 2.7. I have two arrays, A and B. To find the indices of the elements in A that are present in B, I can do

A_inds = np.in1d(A,B) 

I also want to get the indices of the elements in B that are present in A, i.e. the indices in B of the same overlapping elements I found using the above code.

Currently I am running the same line again as follows:

B_inds = np.in1d(B,A) 

but this extra calculation seems like it should be unnecessary. Is there a more computationally efficient way of obtaining both A_inds and B_inds?

I am open to using either list or array methods.

6
  • What are the input array sizes? Are they 1D? Commented Sep 18, 2015 at 14:15
  • Large. Of the order of 10^6 or 10^7. Commented Sep 18, 2015 at 14:16
  • 1
    Do those arrays have unique elements? Are they sorted? Commented Sep 18, 2015 at 14:28
  • Unfortunately, no. There are a number of duplicate elements - about 5-10% of the array. And yes, they are 1D. Commented Sep 18, 2015 at 14:42
  • 1
    The elements aren't strictly sorted. In fact, they are tuples. Perhaps I should have mentioned that earlier. Commented Sep 18, 2015 at 14:51

2 Answers 2

3

np.unique and np.searchsorted could be used together to solve it -

def unq_searchsorted(A,B): # Get unique elements of A and B and the indices based on the uniqueness unqA,idx1 = np.unique(A,return_inverse=True) unqB,idx2 = np.unique(B,return_inverse=True) # Create mask equivalent to np.in1d(A,B) and np.in1d(B,A) for unique elements mask1 = (np.searchsorted(unqB,unqA,'right') - np.searchsorted(unqB,unqA,'left'))==1 mask2 = (np.searchsorted(unqA,unqB,'right') - np.searchsorted(unqA,unqB,'left'))==1 # Map back to all non-unique indices to get equivalent of np.in1d(A,B), # np.in1d(B,A) results for non-unique elements return mask1[idx1],mask2[idx2] 

Runtime tests and verify results -

In [233]: def org_app(A,B): ...: return np.in1d(A,B), np.in1d(B,A) ...: In [234]: A = np.random.randint(0,10000,(10000)) ...: B = np.random.randint(0,10000,(10000)) ...: In [235]: np.allclose(org_app(A,B)[0],unq_searchsorted(A,B)[0]) Out[235]: True In [236]: np.allclose(org_app(A,B)[1],unq_searchsorted(A,B)[1]) Out[236]: True In [237]: %timeit org_app(A,B) 100 loops, best of 3: 7.69 ms per loop In [238]: %timeit unq_searchsorted(A,B) 100 loops, best of 3: 5.56 ms per loop 

If the two input arrays are already sorted and unique, the performance boost would be substantial. Thus, the solution function would simplify to -

def unq_searchsorted_v1(A,B): out1 = (np.searchsorted(B,A,'right') - np.searchsorted(B,A,'left'))==1 out2 = (np.searchsorted(A,B,'right') - np.searchsorted(A,B,'left'))==1 return out1,out2 

Subsequent runtime tests -

In [275]: A = np.random.randint(0,100000,(20000)) ...: B = np.random.randint(0,100000,(20000)) ...: A = np.unique(A) ...: B = np.unique(B) ...: In [276]: np.allclose(org_app(A,B)[0],unq_searchsorted_v1(A,B)[0]) Out[276]: True In [277]: np.allclose(org_app(A,B)[1],unq_searchsorted_v1(A,B)[1]) Out[277]: True In [278]: %timeit org_app(A,B) 100 loops, best of 3: 8.83 ms per loop In [279]: %timeit unq_searchsorted_v1(A,B) 100 loops, best of 3: 4.94 ms per loop 
Sign up to request clarification or add additional context in comments.

2 Comments

Could this be expanded to 3 arrays? (or n arrays, even?)
@hm8 I think a new question would be suited as it doesn't look like an easy extension.
1

A simple multiprocessing implementation will get you a little more speed:

import time import numpy as np from multiprocessing import Process, Queue a = np.random.randint(0, 20, 1000000) b = np.random.randint(0, 20, 1000000) def original(a, b, q): q.put( np.in1d(a, b) ) if __name__ == '__main__': t0 = time.time() q = Queue() q2 = Queue() p = Process(target=original, args=(a, b, q,)) p2 = Process(target=original, args=(b, a, q2)) p.start() p2.start() res = q.get() res2 = q2.get() print time.time() - t0 >>> 0.21398806572 

Divakar's unq_searchsorted(A,B) method took 0.271834135056 seconds on my machine.

1 Comment

Thank you for this - it will certainly be useful. For now, though I am looking for the fastest method on a single core because I will be distributing the whole code over several cores later on.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.