Scipy csr_matrix does not copy correctly

Question

I am having some issues calculating from one csr_matrix and creating a new on from the results. In trying to track it down, I just did some simple code to copy the original matrix and the copy is not the same. I have tried it on very small matrices (as given in the documentation), but on a real world matrix (about 2.5M entries, all of which are non-zero), the results are bizarre. Here is the test code:

print type(X_ngrams) tst = csr_matrix( (X_ngrams.data,X_ngrams.nonzero())) print "Original:" print "shape ", X_ngrams.shape r1,c1=X_ngrams.nonzero() print "rows, cols", r1[:10],c1[:10] print "indptr ", X_ngrams.indptr[:10] print "indices ", X_ngrams.indices[:10] print "data[:10] ", X_ngrams.data[:10] # print print "Copy:" print "shape ", tst.shape r2,c2=tst.nonzero() print "rows, cols", r2[:10],c2[:10] print "indptr ", tst.indptr[:10] print "indices ", tst.indices[:10] print "data[:10] ", tst.data[:10]

Here are the results:

<class 'scipy.sparse.csr.csr_matrix'> Original: shape (2257, 202262) rows, cols [0 0 0 0 0 0 0 0 0 0] [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] indptr [ 0 518 1247 3156 3634 4368 5594 6670 8540 9257] indices [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] data[:10] [ 2 1 23 1 35 1 11 1 8 1] Copy: shape (2257, 202262) rows, cols [0 0 0 0 0 0 0 0 0 0] [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611] indptr [ 0 518 1247 3156 3634 4368 5594 6670 8540 9257] indices [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611] data[:10] [20 1 1 1 1 1 1 1 1 1]

Why is the structure of the copy different? The matrix I need to create should have exactly the same structure, just a different number in each position.

Did you notice that you mislabeled your output of the copy? You print tst.data[:10] after the label "indptr ". — Warren Weckesser
– Warren Weckesser, Commented Jul 23, 2015 at 3:38
Since we don't have the full code, I have to ask: are you sure X_ngrams is a csr_matrix? — Warren Weckesser
– Warren Weckesser, Commented Jul 23, 2015 at 3:40
Warren, thanks for catching the typos. I have updated the code. I also added the type of X_ngrams to show that it is, in fact, a csr_matrix — user1427057
– user1427057, Commented Jul 23, 2015 at 21:50

hpaulj · Accepted Answer · 2015-07-24 14:51:51Z

I can't replicate your problem with the data you give, but I suspect the problem lies with X_ngrams not being sorted, while the copy is sorted. The sorting was performed by nonzero.

Compare the 2 indices. Both are a small sample of the 500+ values in the first row:

indices [ 69627 70494 168418 174006 157892 161787 146945 148354 51951 53422] indices [1439 2461 2561 2683 2748 4279 6212 6275 6332 6611]

The 2nd list is smaller, and sorted. What's the value of X_ngrams.has_sorted_indices?

What you really need to compare is nonzero of both.

One solution is to sort X_ngrams first

 X._ngrams.sort_indices() # sort in place

You might also consider using M.copy() or M.tocsr(copy=True). M.sorted_indices() returns a copy with sorted indicies.

This format:

sparse.csr_matrix((M.data, M.indices, M.indptr))

makes a copy using the same arrays a M. Or if you want them to be copies:

sparse.csr_matrix((M.data.copy(), M.indices.copy(), M.indptr.copy()))

Thank you. Sorting the indices on X_ngrams and then doing the copy based on the indices and indptrs seems to work. I am not really clear I understand why that is necessary, but I will use that approach. My real goal is not actually to copy the matrix, but to perform an element-by-element evaluation that does not seem to be possible with the built-in sparse array operators. By doing what you suggest, I can make a modified copy of the original data and build a new matrix with exactly the same elements. Thank you.

Collectives™ on Stack Overflow

Scipy csr_matrix does not copy correctly

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related