efficient way of constructing a matrix of pair-wise distances between many vectors?

Question

First, thanks for reading and taking the time to respond.

Second, the question:

I have a PxN matrix X where P is in the order of 10^6 and N is in the order of 10^3. So, X is relatively large and is not sparse. Let's say each row of X is an N-dimensional sample. I want to construct a PxP matrix of pairwise distances between these P samples. Let's also say I am interested in Hellinger distances.

So far I am relying on sparse dok matrices:

def hellinger_distance(X): P = X.shape[0] H1 = sp.sparse.dok_matrix((P, P)) for i in xrange(P): if i%100 == 0: print i x1 = X[i] X2 = X[i:P] h = np.sqrt(((np.sqrt(x1) - np.sqrt(X2))**2).sum(1)) / math.sqrt(2) H1[i, i:P] = h H = H1 + H1.T return H

This is super slow. Is there a more efficient way of doing this? Any help is much appreciated.

Do you really need to compute every pairwise distance? Your PxP matrix would contain 1E12 elements, so assuming you are dealing with double floats, you're looking at about 8000 GB just to store the output. If you just computed the upper triangle, that would be (P^2 - P) / 2 ~= 5E11 elements, which would be about 4000 GB assuming double floats. Even a binary matrix of that size would be about 500 GB in size. — ali_m
– ali_m, Commented Oct 7, 2015 at 18:23
you are correct. I learnt that with dense matrices in scipy you can calculate the condensed form (upper triangular only) directly, but not with sparse matrices. — JRun
– JRun, Commented Aug 16, 2016 at 22:09

Divakar · Accepted Answer · 2015-10-07 18:02:41Z

2

You can use pdist and squareform from scipy.spatial.distance -

from scipy.spatial.distance import pdist, squareform out = squareform(pdist(np.sqrt(X)))/np.sqrt(2)

Or use cdist from the same -

from scipy.spatial.distance import cdist sX = np.sqrt(X) out = cdist(sX,sX)/np.sqrt(2)

edited Oct 7, 2015 at 18:02

answered Oct 7, 2015 at 17:55

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JRun Over a year ago

Thanks for the quick and clean response. I am still running this and the computation has not finished yet. I will update as I get the results.

Divakar Over a year ago

@JRun Awesome! Let me know how that goes! And the performance boost (hoping there is some!).

JRun · Accepted Answer · 2015-10-07 18:57:14Z

In addition to Divakar's response, I realized that there is an implementation of this in sklearn which allows parallel processing:

from sklearn.metrics.pairwise import pairwise_distances njobs = 3 H = pairwise_distances(np.sqrt(X), n_jobs=njobs, metric='euclidean') / math.sqrt(2)

I will do some benchmarking and post the results later.

Collectives™ on Stack Overflow

efficient way of constructing a matrix of pair-wise distances between many vectors?

2 Answers 2

2 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Linked

Related