I've tried to run this little code, it just takes random points (here 50k, near of what I have in reality) and returns the 10th nearest points for each point randomly selected.
But unfortunately this is (really !) long because of the loop for sure.
As I'm pretty new in 'code optimization', is there a trick to make this mush faster ? (Faster at Python scale I know I'm not coding in C++).
Here is a reproducible example with data size close to what I have:
import time import numpy as np from numpy import random from scipy.spatial import distance # USEFUL FUNCTION start_time = time.time() def closest_node(node, nodes): nodes = np.asarray(nodes) deltas = nodes - node dist_2 = np.einsum("ij,ij->i", deltas, deltas) ndx = dist_2.argsort() return data[ndx[:10]] # REPRODUCIBLE DATA mean = np.array([0.0, 0.0, 0.0]) cov = np.array([[1.0, -0.5, 0.8], [-0.5, 1.1, 0.0], [0.8, 0.0, 1.0]]) data = np.random.multivariate_normal(mean, cov, 500000) # START RUNNING points = data[np.random.choice(data.shape[0], int(np.round(0.1 * len(data), 0)))] print(len(points)) for w in points: closest_node(w, data) print("--- %s seconds ---" % (time.time() - start_time))
w in points, obviously linear in len(points). But the second one isdist_2.argsort, which is typically O(N log(N)) ... but with N=len(data)=500000 in your case; this is the expensive part!