2

Suppose you have two 2D arrays A and B, and you want to check, where a row of A is contained in B. How do you do this most efficiently using numpy?

E.g.

a = np.array([[1,2,3], [4,5,6], [9,10,11]]) b = np.array([[4,5,6], [4,3,2], [1,2,3], [4,8,9]]) map = [[0,2], [1,0]] # row 0 of a is at row index 2 of array B 

I know how to check if a row of A is in B using in1d (test for membership in a 2d numpy array), but this does not yield the indices map.

The purpose of this map is to (finally) merge the two arrays together based on some columns.
Of course one could do this row by row, but this gets very inefficient, since my arrays have the shape (50 Mio., 20).

An alternative would be to use the pandas merge function, but I'd like to do this using numpy only.

4
  • 1
    Are the numbers in the array ints? If so, are they positive? If so, do they have a range within which they might be limited to? Commented Apr 10, 2019 at 13:00
  • Unfortunately not, they are float64 without any limits. Commented Apr 10, 2019 at 14:02
  • Could there by duplicates (identical rows) within an array? Commented Apr 10, 2019 at 14:07
  • No, in my case not. Commented Apr 10, 2019 at 14:10

2 Answers 2

3

Approach #1

Here's one based on views. Makes use of np.argwhere (docs) to return the indices of an element that meet a condition, in this case, membership. -

def view1D(a, b): # a, b are arrays a = np.ascontiguousarray(a) b = np.ascontiguousarray(b) void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1])) return a.view(void_dt).ravel(), b.view(void_dt).ravel() def argwhere_nd(a,b): A,B = view1D(a,b) return np.argwhere(A[:,None] == B) 

Approach #2

Here's another that would be O(n) and hence much better on performance, especially on large arrays -

def argwhere_nd_searchsorted(a,b): A,B = view1D(a,b) sidxB = B.argsort() mask = np.isin(A,B) cm = A[mask] idx0 = np.flatnonzero(mask) idx1 = sidxB[np.searchsorted(B,cm, sorter=sidxB)] return idx0, idx1 # idx0 : indices in A, idx1 : indices in B 

Approach #3

Another O(n) one using argsort() -

def argwhere_nd_argsort(a,b): A,B = view1D(a,b) c = np.r_[A,B] idx = np.argsort(c,kind='mergesort') cs = c[idx] m0 = cs[:-1] == cs[1:] return idx[:-1][m0],idx[1:][m0]-len(A) 

Sample runs with same inputs as earlier -

In [650]: argwhere_nd_searchsorted(a,b) Out[650]: (array([0, 1]), array([2, 0])) In [651]: argwhere_nd_argsort(a,b) Out[651]: (array([0, 1]), array([2, 0])) 
Sign up to request clarification or add additional context in comments.

Comments

1

You can take advantage of the automatic broadcasting:

np.argwhere(np.all(a.reshape(3,1,-1) == b,2)) 

which results in

array([[0, 2], [1, 0]]) 

Note for floats you might want to replace the == with np.islclose()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.