In R, I'm trying to work with a large matrix (39,146,166 rows by 127 columns) and I'm having memory issues with a number of operations on it. I've determined that about 35% of the entries in the matrix are non-zero, and the remainder are all zeros. Is this sparse enough that I would save some memory representing this matrix using one of R's sparse matrix classes? What is a good rule of thumb for determining when a matrix is worth representing sparsely?
1 Answer
I don't think the sparse representation will be that much more compact. You need three numbers for each numeric item other than an implicit zero. So even if two of those are 4 byte integers the space in memory will still be larger than a "serial" storage strategy.
By this reasoning anything above 50% will take more storage space, but I'm posting from an iPhone under SF Bay so cannot test with 'object.size'.
4 Comments
Ryan C. Thompson
There are a number of sparse matrix formats, and not all of them require 3 numbers per nonzero entry. For example, this format requires about 2 for my case: netlib.org/linalg/html_templates/node92.html
IRTFM
@RyanThompson: That format requires 3 vectors, not two
Ryan C. Thompson
Only two of those vectors have an entry for each data point. The last vector only has one element for each column, which is negligible in my case.
IRTFM
And if that were any of the representations in the R Matrix package then you could use it as a basis for estimation. But as far as I can tell neither the T-matrix nor the C-matrix versions use such a method.