0

I've tried to run this little code, it just takes random points (here 50k, near of what I have in reality) and returns the 10th nearest points for each point randomly selected.

But unfortunately this is (really !) long because of the loop for sure.

As I'm pretty new in 'code optimization', is there a trick to make this mush faster ? (Faster at Python scale I know I'm not coding in C++).

Here is a reproducible example with data size close to what I have:

import time import numpy as np from numpy import random from scipy.spatial import distance # USEFUL FUNCTION start_time = time.time() def closest_node(node, nodes): nodes = np.asarray(nodes) deltas = nodes - node dist_2 = np.einsum("ij,ij->i", deltas, deltas) ndx = dist_2.argsort() return data[ndx[:10]] # REPRODUCIBLE DATA mean = np.array([0.0, 0.0, 0.0]) cov = np.array([[1.0, -0.5, 0.8], [-0.5, 1.1, 0.0], [0.8, 0.0, 1.0]]) data = np.random.multivariate_normal(mean, cov, 500000) # START RUNNING points = data[np.random.choice(data.shape[0], int(np.round(0.1 * len(data), 0)))] print(len(points)) for w in points: closest_node(w, data) print("--- %s seconds ---" % (time.time() - start_time)) 
2
  • Hi and welcome to SO! Nice first question. It is well contained and you have a working example. However, it is not clear to me exactly what you are asking. Is is why the code is slow? Commented Jul 6, 2021 at 11:24
  • Your code combines two algorithmic complexities: the first one is the loop w in points, obviously linear in len(points). But the second one is dist_2.argsort, which is typically O(N log(N)) ... but with N=len(data)=500000 in your case; this is the expensive part! Commented Jul 6, 2021 at 11:46

2 Answers 2

2

The time it takes to run argsort every loop on your 500000 element array is huge. The only improvement I can think of is to use something that can return the smallest 10 elements without fully sorting the whole array.

A fast way to find the largest N elements in an numpy array

So instead of

ndx = dist_2.argsort() return data[ndx[:10]] 

It would be

ndx = np.argpartition(dist_2, 10)[:10] return data[ndx[:10]] 

I only benchmarked on 500 points because it already took quite some time to run on my PC.

N=500 Using argsort: 25.625439167022705 seconds Using argpartition: 6.637120485305786 seconds 
Sign up to request clarification or add additional context in comments.

Comments

0

You would be probably best off analyzing the slowest points via profiler: How do I find out what parts of my code are inefficient in Python

One thing that might be possible at first glance, is that you should probably move as much as possible outside loop. If you are going to convert points via np.asarray(), it possibly might be better to do it once for all points before the loop, and use the result in function, rather than doing np.asarray() in each loop run.

1 Comment

np.asarray should be a no-op if it's argument is already a numpy.ndarray, which seems to be the case here

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.