1

I am trying to speed up the code for the following script (ideally >4x) without multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.

import numpy as np def loop(x,y,q,z): matchlist = [] for ind in range(len(x)): matchlist.append(find_match(x[ind],y[ind],q,z)) return matchlist def find_match(x,y,q,z): A = np.where(q == x) B = np.where(z == y) return np.intersect1d(A,B) # N will finally scale up to 10^9 N = 1000 M = 300 X = np.random.randint(M, size=N) Y = np.random.randint(M, size=N) # Q and Z size is fixed at 120000 Q = np.random.randint(M, size=120000) Z = np.random.randint(M, size=120000) # convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers) X = np.char.mod('%d', X) Y = np.char.mod('%d', Y) Q = np.char.mod('%d', Q) Z = np.char.mod('%d', Z) matchlist = loop(X,Y,Q,Z) 

I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).

I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.

Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):

Example output of matchlist

The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:

timeit test of loop()

Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.

timeit test of find_match()

One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.

I've tried to loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where().

The question is closely related to a recent question from philshem (Combine several NumPy "where" statements to one to improve performance), however, the solutions provided on this question do not work for my approach here.

3
  • how are you using matchlist in follow-on processes? As indices? len? Just a record? Commented Jan 25, 2021 at 15:30
  • Basically, would a boolean array with the same data (i.e. True at those indices) be equivalent? Commented Jan 25, 2021 at 15:36
  • The indices of matches for each row would be nice to have. But I could also live with a "True" (= there is at least one match) or "False" (= no match) if that considerably increases speed. Commented Jan 25, 2021 at 16:41

2 Answers 2

2

Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.

lookup = {} for i, (q, z) in enumerate(zip(Q, Z)): lookup.setdefault((q, z), []).append(i) matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)] 

If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.

lookup = np.zeros((300, 300), dtype=bool) lookup[Q, Z] = True matchlist = lookup[X, Y] 

You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.

Sign up to request clarification or add additional context in comments.

9 Comments

Your first suggestion is 20x faster than my solution! (115 ms vs. 2.29s) Thanks a lot, I think this should be already sufficient for my needs. Note to the second approach: I could not get this to work in my script since it throws out the error "lookup[q, z] = True IndexError: arrays used as indices must be of integer (or boolean) type"...
@crisprog Yes, the second solution requires converting the strings to unique integer IDs.
I now implemented your (dictionary) solution in my original dataset and scales up incredibly well. The script is now finished within minutes instead of days!
I'm currently trying to further increase the performance by making use of multiprocessing (e.g. splitting up my lists first and distribute the workload on several cpus). If you know an approach to do that it would be nice to hear it. :) stackoverflow.com/questions/66286370/…
@crisprog Don't bother with multiple threads. If you need more speed, use a more efficient format. You're making a huge array when you almost certainly don't just want a huge pile of data. Do something that makes more sense holistically.
|
1

You only have 300*300 = 90000 unique answers. Pre-compute.

Q_ = np.arange(300)[:, None] == Q Z_ = np.arange(300)[:, None] == Z lookup = np.logical_and(Q_[:, None, :], Z_) lookup.shape Out[]: (300, 300, 120000) 

Then the result is just:

out = lookup[X, Y] 

If you really want the indices you can do:

i = np.where(out) out2 = np.split(i[1], np.flatnonzero(np.diff(i[0]))+1) 

You'll parallelize by chunking with this method, since a boolean array of shape(120000, 1000000000) will throw a MemoryError.

1 Comment

Somehow your solution gives me the error "lookup = np.logical_and(Q_[:, None, :], Z_) TypeError: 'bool' object is not subscriptable"?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.