0

I am having some issues calculating from one csr_matrix and creating a new on from the results. In trying to track it down, I just did some simple code to copy the original matrix and the copy is not the same. I have tried it on very small matrices (as given in the documentation), but on a real world matrix (about 2.5M entries, all of which are non-zero), the results are bizarre. Here is the test code:

print type(X_ngrams) tst = csr_matrix( (X_ngrams.data,X_ngrams.nonzero())) print "Original:" print "shape ", X_ngrams.shape r1,c1=X_ngrams.nonzero() print "rows, cols", r1[:10],c1[:10] print "indptr ", X_ngrams.indptr[:10] print "indices ", X_ngrams.indices[:10] print "data[:10] ", X_ngrams.data[:10] # print print "Copy:" print "shape ", tst.shape r2,c2=tst.nonzero() print "rows, cols", r2[:10],c2[:10] print "indptr ", tst.indptr[:10] print "indices ", tst.indices[:10] print "data[:10] ", tst.data[:10] 

Here are the results:

<class 'scipy.sparse.csr.csr_matrix'> Original: shape (2257, 202262) rows, cols [0 0 0 0 0 0 0 0 0 0] [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] indptr [ 0 518 1247 3156 3634 4368 5594 6670 8540 9257] indices [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] data[:10] [ 2 1 23 1 35 1 11 1 8 1] Copy: shape (2257, 202262) rows, cols [0 0 0 0 0 0 0 0 0 0] [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611] indptr [ 0 518 1247 3156 3634 4368 5594 6670 8540 9257] indices [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611] data[:10] [20 1 1 1 1 1 1 1 1 1] 

Why is the structure of the copy different? The matrix I need to create should have exactly the same structure, just a different number in each position.

3
  • Did you notice that you mislabeled your output of the copy? You print tst.data[:10] after the label "indptr ". Commented Jul 23, 2015 at 3:38
  • Since we don't have the full code, I have to ask: are you sure X_ngrams is a csr_matrix? Commented Jul 23, 2015 at 3:40
  • Warren, thanks for catching the typos. I have updated the code. I also added the type of X_ngrams to show that it is, in fact, a csr_matrix Commented Jul 23, 2015 at 21:50

1 Answer 1

1

I can't replicate your problem with the data you give, but I suspect the problem lies with X_ngrams not being sorted, while the copy is sorted. The sorting was performed by nonzero.

Compare the 2 indices. Both are a small sample of the 500+ values in the first row:

indices [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] indices [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611] 

The 2nd list is smaller, and sorted. What's the value of X_ngrams.has_sorted_indices?

What you really need to compare is nonzero of both.

One solution is to sort X_ngrams first

 X._ngrams.sort_indices() # sort in place 

You might also consider using M.copy() or M.tocsr(copy=True). M.sorted_indices() returns a copy with sorted indicies.

This format:

sparse.csr_matrix((M.data, M.indices, M.indptr)) 

makes a copy using the same arrays a M. Or if you want them to be copies:

sparse.csr_matrix((M.data.copy(), M.indices.copy(), M.indptr.copy())) 
Sign up to request clarification or add additional context in comments.

2 Comments

Interesting. The value of X_ngrams.has_sorted_indices = 0.
Thank you. Sorting the indices on X_ngrams and then doing the copy based on the indices and indptrs seems to work. I am not really clear I understand why that is necessary, but I will use that approach. My real goal is not actually to copy the matrix, but to perform an element-by-element evaluation that does not seem to be possible with the built-in sparse array operators. By doing what you suggest, I can make a modified copy of the original data and build a new matrix with exactly the same elements. Thank you.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.