Reverse sort and argsort in python

Question

I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:

Compute vector of inner products between doc idx and all other documents
Sort in descending order
Return the "scores" and indices from the second one to the end (i.e. not itself)

The code I have at the moment is:

import h5py import numpy as np def get_related(tfidf, idx) : ''' return the top documents ''' # calculate inner product v = np.inner(tfidf, tfidf[idx].transpose()) # sort vs = np.sort(v.toarray(), axis=0)[::-1] scores = vs[1:,] # sort indices vi = np.argsort(v.toarray(), axis=0)[::-1] idxs = vi[1:,] return (scores, idxs)

where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.

This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.

Can this be done more efficiently?
Can this be done without converting the sparse matrix using toarray()?

Fred Foo · Accepted Answer · 2012-09-28 08:22:31Z

I don't think there's any real need to skip the toarray. The v array will be only n_docs long, which is dwarfed by the size of the n_docs × n_terms tf-idf matrix in practical situations. Also, it will be quite dense since any term shared by two documents will give them a non-zero similarity. Sparse matrix representations only pay off when the matrix you're storing is very sparse (I've seen >80% figures for Matlab and assume that Scipy will be similar, though I don't have an exact figure).

The double sort can be skipped by doing

v = v.toarray() vi = np.argsort(v, axis=0)[::-1] vs = v[vi]

Btw., your use of np.inner on sparse matrices is not going to work with the latest versions of NumPy; the safe way of taking an inner product of two sparse matrices is

v = (tfidf * tfidf[idx, :]).transpose()

Thanks for the swift response. Just wondering, do you know how the toarray() function works - I take it that it doesn't make a copy of the data
@tdc: it does make a copy. And it fills in the zero positions.
@tdc: I just realised that there's one more important optimization to make: you should be using CSR sparse matrices. In any other representation, the inner product computation will be suboptimal.
1) can I do things like sorting without making a copy? 2) how expensive is the translation from csc to csr?
1) Not that I know. 2) Very cheap. I believe it's just a matter of rearranging some indices, without the data being actually copied.

Collectives™ on Stack Overflow

Reverse sort and argsort in python

1 Answer 1

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Related