I have two numpy arrays, X and Y, with shapes (n,d) and (m,d), respectively. Assume that we want to compute the Euclidean distances between each row of X and each row of Y and store the result in array Z with shape (n,m). I have two implementations for this. The first implementation uses two for loops as follows:
for i in range(n): for j in range(m): Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j]))) The second implementation uses only one loop by vectorization:
for i in range(n): Z[i] = np.sqrt(np.sum(np.square(X[i]-Y), axis=1)) When I run these codes on a particular X and Y data, the first implementation takes nearly 30 seconds while the second implementation takes nearly 60 seconds. I expect the second implementation to be faster since it uses vectorization. What is the reason of its slow running? I know that we can obtain faster implementations by fully vectorizing the code, but I don't understand why the second code (which is partially vectorized) is slower than non-vectorized version.
Here is the complete code:
n,m,d = 5000,500,3000 X = np.random.rand(n,d) Y = np.random.rand(m,d) Z = np.zeros((n,m)) tic = time.time() for i in range(n): for j in range(m): Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j]))) print('Elapsed time 1: ', time.time()-tic) tic = time.time() for i in range(n): Z[i] = np.sqrt(np.sum(np.square(X[i]-Y), axis=1)) print('Elapsed time 2: ', time.time()-tic) tic = time.time() train_squared = np.square(X).sum(axis=1).reshape((1,n)) test_squared = np.square(Y).sum(axis=1).reshape((m,1)) test_train = -2*np.matmul(Y, X.T) dists = np.sqrt(test_train + train_squared + test_squared) print('Elapsed time 3: ', time.time()-tic) And this is the output:
Elapsed time 1: 35.659096002578735 Elapsed time 2: 65.57051086425781 Elapsed time 3: 0.3912069797515869
n,m,din your testing?n=5000, m=500, d=3000(d,); the second version uses temporaries of shape(m, d), much bigger. I don't know what your fully vectorized solution looks like, so I don't know what size of temporaries that's producing.