Increase performance of np.where() loop

Question

I am trying to speed up the code for the following script (ideally >4x) without multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.

import numpy as np def loop(x,y,q,z): matchlist = [] for ind in range(len(x)): matchlist.append(find_match(x[ind],y[ind],q,z)) return matchlist def find_match(x,y,q,z): A = np.where(q == x) B = np.where(z == y) return np.intersect1d(A,B) # N will finally scale up to 10^9 N = 1000 M = 300 X = np.random.randint(M, size=N) Y = np.random.randint(M, size=N) # Q and Z size is fixed at 120000 Q = np.random.randint(M, size=120000) Z = np.random.randint(M, size=120000) # convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers) X = np.char.mod('%d', X) Y = np.char.mod('%d', Y) Q = np.char.mod('%d', Q) Z = np.char.mod('%d', Z) matchlist = loop(X,Y,Q,Z)

I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).

I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.

Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):

The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:

Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.

One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.

I've tried to loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where().

The question is closely related to a recent question from philshem (Combine several NumPy "where" statements to one to improve performance), however, the solutions provided on this question do not work for my approach here.

how are you using matchlist in follow-on processes? As indices? len? Just a record? — Daniel F
– Daniel F, Commented Jan 25, 2021 at 15:30
Basically, would a boolean array with the same data (i.e. True at those indices) be equivalent? — Daniel F
– Daniel F, Commented Jan 25, 2021 at 15:36
The indices of matches for each row would be nice to have. But I could also live with a "True" (= there is at least one match) or "False" (= no match) if that considerably increases speed. — crisprog
– crisprog, Commented Jan 25, 2021 at 16:41

Veedrac · Accepted Answer · 2021-01-25 17:15:47Z

2

Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.

lookup = {} for i, (q, z) in enumerate(zip(Q, Z)): lookup.setdefault((q, z), []).append(i) matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]

If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.

lookup = np.zeros((300, 300), dtype=bool) lookup[Q, Z] = True matchlist = lookup[X, Y]

You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.

answered Jan 25, 2021 at 17:15

Veedrac

60.7k15 gold badges120 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

crisprog Over a year ago

Your first suggestion is 20x faster than my solution! (115 ms vs. 2.29s) Thanks a lot, I think this should be already sufficient for my needs. Note to the second approach: I could not get this to work in my script since it throws out the error "lookup[q, z] = True IndexError: arrays used as indices must be of integer (or boolean) type"...

Veedrac Over a year ago

@crisprog Yes, the second solution requires converting the strings to unique integer IDs.

crisprog Over a year ago

I now implemented your (dictionary) solution in my original dataset and scales up incredibly well. The script is now finished within minutes instead of days!

crisprog Over a year ago

I'm currently trying to further increase the performance by making use of multiprocessing (e.g. splitting up my lists first and distribute the workload on several cpus). If you know an approach to do that it would be nice to hear it. :) stackoverflow.com/questions/66286370/…

Veedrac Over a year ago

@crisprog Don't bother with multiple threads. If you need more speed, use a more efficient format. You're making a huge array when you almost certainly don't just want a huge pile of data. Do something that makes more sense holistically.

|

Daniel F · Accepted Answer · 2021-01-25 15:47:00Z

You only have 300*300 = 90000 unique answers. Pre-compute.

Q_ = np.arange(300)[:, None] == Q Z_ = np.arange(300)[:, None] == Z lookup = np.logical_and(Q_[:, None, :], Z_) lookup.shape Out[]: (300, 300, 120000)

Then the result is just:

out = lookup[X, Y]

If you really want the indices you can do:

i = np.where(out) out2 = np.split(i[1], np.flatnonzero(np.diff(i[0]))+1)

You'll parallelize by chunking with this method, since a boolean array of shape(120000, 1000000000) will throw a MemoryError.

Somehow your solution gives me the error "lookup = np.logical_and(Q_[:, None, :], Z_) TypeError: 'bool' object is not subscriptable"?

Collectives™ on Stack Overflow

Increase performance of np.where() loop

2 Answers 2

9 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

1 Comment

Linked

Related