I'm trying to write a function in Python (still a noob!) which returns indices and scores of documents ordered by the inner products of their tfidf scores. The procedure is:
- Compute vector of inner products between doc
idxand all other documents - Sort in descending order
- Return the "scores" and indices from the second one to the end (i.e. not itself)
The code I have at the moment is:
import h5py import numpy as np def get_related(tfidf, idx) : ''' return the top documents ''' # calculate inner product v = np.inner(tfidf, tfidf[idx].transpose()) # sort vs = np.sort(v.toarray(), axis=0)[::-1] scores = vs[1:,] # sort indices vi = np.argsort(v.toarray(), axis=0)[::-1] idxs = vi[1:,] return (scores, idxs) where tfidf is a sparse matrix of type '<type 'numpy.float64'>'.
This seems inefficient, as the sort is performed twice (sort() then argsort()), and the results have to then be reversed.
- Can this be done more efficiently?
- Can this be done without converting the sparse matrix using
toarray()?