I am having some issues calculating from one csr_matrix and creating a new on from the results. In trying to track it down, I just did some simple code to copy the original matrix and the copy is not the same. I have tried it on very small matrices (as given in the documentation), but on a real world matrix (about 2.5M entries, all of which are non-zero), the results are bizarre. Here is the test code:
print type(X_ngrams) tst = csr_matrix( (X_ngrams.data,X_ngrams.nonzero())) print "Original:" print "shape ", X_ngrams.shape r1,c1=X_ngrams.nonzero() print "rows, cols", r1[:10],c1[:10] print "indptr ", X_ngrams.indptr[:10] print "indices ", X_ngrams.indices[:10] print "data[:10] ", X_ngrams.data[:10] # print print "Copy:" print "shape ", tst.shape r2,c2=tst.nonzero() print "rows, cols", r2[:10],c2[:10] print "indptr ", tst.indptr[:10] print "indices ", tst.indices[:10] print "data[:10] ", tst.data[:10] Here are the results:
<class 'scipy.sparse.csr.csr_matrix'> Original: shape (2257, 202262) rows, cols [0 0 0 0 0 0 0 0 0 0] [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] indptr [ 0 518 1247 3156 3634 4368 5594 6670 8540 9257] indices [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] data[:10] [ 2 1 23 1 35 1 11 1 8 1] Copy: shape (2257, 202262) rows, cols [0 0 0 0 0 0 0 0 0 0] [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611] indptr [ 0 518 1247 3156 3634 4368 5594 6670 8540 9257] indices [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611] data[:10] [20 1 1 1 1 1 1 1 1 1] Why is the structure of the copy different? The matrix I need to create should have exactly the same structure, just a different number in each position.
tst.data[:10]after the label"indptr ".X_ngramsis acsr_matrix?