1

First, thanks for reading and taking the time to respond.

Second, the question:

I have a PxN matrix X where P is in the order of 10^6 and N is in the order of 10^3. So, X is relatively large and is not sparse. Let's say each row of X is an N-dimensional sample. I want to construct a PxP matrix of pairwise distances between these P samples. Let's also say I am interested in Hellinger distances.

So far I am relying on sparse dok matrices:

def hellinger_distance(X): P = X.shape[0] H1 = sp.sparse.dok_matrix((P, P)) for i in xrange(P): if i%100 == 0: print i x1 = X[i] X2 = X[i:P] h = np.sqrt(((np.sqrt(x1) - np.sqrt(X2))**2).sum(1)) / math.sqrt(2) H1[i, i:P] = h H = H1 + H1.T return H 

This is super slow. Is there a more efficient way of doing this? Any help is much appreciated.

2
  • 2
    Do you really need to compute every pairwise distance? Your PxP matrix would contain 1E12 elements, so assuming you are dealing with double floats, you're looking at about 8000 GB just to store the output. If you just computed the upper triangle, that would be (P^2 - P) / 2 ~= 5E11 elements, which would be about 4000 GB assuming double floats. Even a binary matrix of that size would be about 500 GB in size. Commented Oct 7, 2015 at 18:23
  • you are correct. I learnt that with dense matrices in scipy you can calculate the condensed form (upper triangular only) directly, but not with sparse matrices. Commented Aug 16, 2016 at 22:09

2 Answers 2

2

You can use pdist and squareform from scipy.spatial.distance -

from scipy.spatial.distance import pdist, squareform out = squareform(pdist(np.sqrt(X)))/np.sqrt(2) 

Or use cdist from the same -

from scipy.spatial.distance import cdist sX = np.sqrt(X) out = cdist(sX,sX)/np.sqrt(2) 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the quick and clean response. I am still running this and the computation has not finished yet. I will update as I get the results.
@JRun Awesome! Let me know how that goes! And the performance boost (hoping there is some!).
1

In addition to Divakar's response, I realized that there is an implementation of this in sklearn which allows parallel processing:

from sklearn.metrics.pairwise import pairwise_distances njobs = 3 H = pairwise_distances(np.sqrt(X), n_jobs=njobs, metric='euclidean') / math.sqrt(2) 

I will do some benchmarking and post the results later.

1 Comment

Good to see alternatives, nice!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.