In Latent Semantic Analysis, how do you recombine the decomposed matrices after truncating the singular values?

Question

I'm trying to understand how you reduce the number of dimensions in a matrix. There's an example on page 13 which I'm trying to replicate using Python's numpy.

Let's call the original occurrence matrix "a" and the three SVD (Singular Value Decomposition) decomposed matrices "U", "S" and "V".

The trouble I'm having is that after I zero out the smaller singular values in "S", when I multiply together "U", "S" and "V" using numpy, the answer is not as it is given in the pdf. The bottom 3 rows are not all zeros. The funny thing is that when I just multiply "S" and "V" I get the right answer.

This is sort of surprising but multiplying "S" and "V" is actually what Manning and Schutze's book Foundations of Statistical Natural Language Processing says you have to do. But this is not what the pdf says you have to do in page 10.

So what's going on here?

Reducing the number of dimensions is a common applied mathematics problem, so you might get a better answer on one of the maths or programming sites if you can get an answer in plain English. I personally could never grok matrices. — hippietrail
– hippietrail, Commented Jan 3, 2014 at 1:41
@hippietrail after seeing Russell's link I guess that truncated SVD is a mathematical operation so it should be relevant in a maths stack exchange. — mtanti
– mtanti, Commented Jan 3, 2014 at 8:45
This question appears to be off-topic because it is about a math problem encountered as an adjunct to NLP. It should be moved to one of the math or programming SE sites. I've voted to close but I'm not sure which site is best to migrate it to. — hippietrail
– hippietrail, Commented Jan 3, 2014 at 9:11
@mtanti: It sounds like a debugging problem if not a matrix multiplication/decomposition problem. On SO I saw some other questions about different tools giving different results for SVD. Sadly though even though I'm a former pro programmer and current hobby linguist, I can't follow that PDF \-: I'm just trying to get you the best answer whether it's on-topic here or not (-: — hippietrail
– hippietrail, Commented Jan 3, 2014 at 11:00

Fred Foo · Accepted Answer · 2014-01-04 13:25:38Z

Multiplying S and V is exactly what you have to do to perform dimensionality reduction with SVD/LSA.

>>> C = np.array([[1, 0, 1, 0, 0, 0], ... [0, 1, 0, 0, 0, 0], ... [1, 1, 0, 0, 0, 0], ... [1, 0, 0, 1, 1, 0], ... [0, 0, 0, 1, 0, 1]]) >>> from scipy.linalg import svd >>> U, s, VT = svd(C, full_matrices=False) >>> s[2:] = 0 >>> np.dot(np.diag(s), VT) array([[ 1.61889806, 0.60487661, 0.44034748, 0.96569316, 0.70302032, 0.26267284], [-0.45671719, -0.84256593, -0.29617436, 0.99731918, 0.35057241, 0.64674677], [ 0. , 0. , 0. , 0. , 0. , 0. ], [ 0. , 0. , 0. , 0. , 0. , 0. ], [ 0. , 0. , 0. , 0. , 0. , 0. ]])

This gives a matrix where all but the last few rows are zeros, so they can be removed, and in practice this is the matrix you would use in applications:

>>> np.dot(np.diag(s[:2]), VT[:2]) array([[ 1.61889806, 0.60487661, 0.44034748, 0.96569316, 0.70302032, 0.26267284], [-0.45671719, -0.84256593, -0.29617436, 0.99731918, 0.35057241, 0.64674677]])

What the PDF describes on page 10 is the recipe to get a low-rank reconstruction of the input C. Rank != dimensionality, and the shear size and density of the reconstruction matrix make it impractical to use in LSA; its purpose is mostly mathematical. One thing you can do with it is check how good the reconstruction is for various values of k:

>>> U, s, VT = svd(C, full_matrices=False) >>> C2 = np.dot(U[:, :2], np.dot(np.diag(s[:2]), VT[:2])) >>> from scipy.spatial.distance import euclidean >>> euclidean(C2.ravel(), C.ravel()) # Frobenius norm of C2 - C 1.6677932876555255 >>> C3 = np.dot(U[:, :3], np.dot(np.diag(s[:3]), VT[:3])) >>> euclidean(C3.ravel(), C.ravel()) 1.0747879905228703

Sanity check against scikit-learn's TruncatedSVD (full disclosure: I wrote that):

>>> from sklearn.decomposition import TruncatedSVD >>> TruncatedSVD(n_components=2).fit_transform(C.T) array([[ 1.61889806, -0.45671719], [ 0.60487661, -0.84256593], [ 0.44034748, -0.29617436], [ 0.96569316, 0.99731918], [ 0.70302032, 0.35057241], [ 0.26267284, 0.64674677]])

the part where they find what C2 is and the bottom rows are all zeros. That's not UΣVᵀ but ΣVᵀ right? It's a mistake in the pdf right?
@mtanti: Example 18.4, right? Yes, that seems to be a mistake. What they compute is not C₂, the low-rank reconstruction, but the compressed matrix Σ₂Vᵀ. The true reconstruction does not have these zero rows.

Collectives™ on Stack Overflow

In Latent Semantic Analysis, how do you recombine the decomposed matrices after truncating the singular values?

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related