6
$\begingroup$

I am having trouble understanding the concept of Sum of Squares in the context of distance matrices (Studer et al. 2010).

The Sum of Squares I am familiar with is the classical $SS$ from ANOVA, performed on contingency table, such as

 sex FE employment joblessness school 1 16 4 0 0 2 8 3 1 8 

From which I can quickly compute the ANOVA Sum of Squares with

dtm = melt(dt, id.vars = c('sex')) dtmcount = count(dtm, sex, value) dtmcount %>% group_by() %>% mutate(grandmean = mean(n)) %>% group_by(sex) %>% mutate(SumSqtTotal = (n - grandmean)^2) %>% group_by(sex) %>% mutate(groupmean = mean(n)) %>% mutate(SSW = (n - groupmean)^2) %>% group_by(sex) %>% mutate(SSB = ( grandmean - groupmean)^2) %>% group_by() %>% summarise(SST = sum(SumSqtTotal), SSW = sum(SSW), SSB = sum(SSB)) # results # SST = SSW SSB 53 20 33 

The Sum of Squares makes sense to me because I understand that we are comparing means and decomposing means.

However, when it comes to distance matrices, I don't understand what the "mean" becomes.

Consider the same data, but this time we are comparing sequences.

 sex Sep.93 Oct.93 Nov.93 Dec.93 1 1 school school school school 2 1 training training training training 3 1 school school school school 4 1 training training training training 5 1 school school school school 6 2 FE FE FE FE 7 2 FE FE FE FE 8 2 training training training training 9 2 school school school school 10 2 employment employment employment employment 

In a classical sequence analysis, we would use a distance algorithm to compute pairwise dissimilarities and end up with a distance matrix, like this one (from Hamming distance)

 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 4 0 4 0 4 4 4 0 4 [2,] 4 0 4 0 4 4 4 0 4 4 [3,] 0 4 0 4 0 4 4 4 0 4 [4,] 4 0 4 0 4 4 4 0 4 4 [5,] 0 4 0 4 0 4 4 4 0 4 [6,] 4 4 4 4 4 0 0 4 4 4 [7,] 4 4 4 4 4 0 0 4 4 4 [8,] 4 0 4 0 4 4 4 0 4 4 [9,] 0 4 0 4 0 4 4 4 0 4 [10,] 4 4 4 4 4 4 4 4 4 0 

The analysis of variance for sequence analysis was developed by Studer et al. (2010).

From the paper (2010), I quote the following :

According to Batagelj (1988), the notion of a gravity center holds for any kind of distances and objects, even though it is not clearly defined for complex nonnumeric objects such as sequences. It is likely that the gravity center does not itself belong to the object space, exactly as the mean of integer values may be a real noninteger value. [...] Even though the gravity center may not be observable, equation (4) provides a comprehensive way to compute the most central sequence, the medoid, of a set using weights. Searching the x that minimizes equation (4) is equivalent to minimizing the sum of the weighted distances from x to all other sequences. (p.8).

They developed a R function in the library TraMineR. It actually enables to get p.values from co-variates using permutation tests.

The output looks like this :

Pseudo ANOVA table: SS df MSE Exp 1.7 1 1.700000 Res 9.6 8 1.200000 Total 11.3 9 1.255556 

However, I fail to completely understand how to compute the Sum of Squares for such matrix (manually if possible), for both the Total and the Explanatory ? What is the "mean" or "center" in this context ?

Thank you.

Data and codes

library(dplyr) library(reshape2) library(TraMineR) # data from TraMineR # dt = structure(list(sex = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"), Sep.93 = structure(c(3L, 4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", "school", "training"), class = "factor"), Oct.93 = structure(c(3L, 4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", "school", "training"), class = "factor"), Nov.93 = structure(c(3L, 4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", "school", "training"), class = "factor"), Dec.93 = structure(c(3L, 4L, 3L, 4L, 3L, 1L, 1L, 4L, 3L, 2L), .Label = c("FE", "employment", "school", "training"), class = "factor")), .Names = c("sex", "Sep.93", "Oct.93", "Nov.93", "Dec.93"), row.names = c(NA, -10L ), class = "data.frame") # To transform the data into COUNT data # dtm = melt(dt, id.vars = c('sex')) dtmcount = count(dtm, sex, value) # The `SS` is easily compute with # dtmcount %>% group_by() %>% mutate(grandmean = mean(n)) %>% group_by(sex) %>% mutate(SumSqtTotal = (n - grandmean)^2) %>% group_by(sex) %>% mutate(groupmean = mean(n)) %>% mutate(SSW = (n - groupmean)^2) %>% group_by(sex) %>% mutate(SSB = ( grandmean - groupmean)^2) %>% group_by() %>% summarise(SST = sum(SumSqtTotal), SSW = sum(SSW), SSB = sum(SSB)) # Hamming distance function # Ham = function(d){ mat = matrix(0, nrow(d), nrow(d)) len = nrow(d) mat = matrix(0, len, len) for(k in 1:len){ for(i in 1:len){ mat[k,i] = sum( ifelse( as.numeric( d[k, ] == d[i, ] ) == 1, 0 , 1) ) } } return(mat) } used in the example, like this Ham(dt[,-1]) # The TraMineR Example # data(mvad) library(dplyr) set.seed(10) mv = mvad %>% group_by(male) %>% sample_n(5) # compute the dissimilarity matrix mvad.ham <- seqdist(mvad.seq, method="HAM") # compute the discrepancy analysis d = dissassoc(mvad.ham, group = mv$male, R=10) 

Ref :

Studer, Matthias, et al. "Discrepancy analysis of state sequences." Sociological Methods & Research 40.3 (2011): 471-510.

Gabadinho, Alexis, et al. "Analyzing and visualizing state sequences in R with TraMineR." Journal of Statistical Software 40.4 (2011): 1-37.

$\endgroup$
12
  • $\begingroup$ I understood it that you've got n cases by variables dataset, and cases are parted in several groups. So, you can compute SStotal, SSbetween, SSwithin quantities. Now, I'd claim that if you compute n x n distance matrix between the cases and that distances are squared euclidean then you also can obtain those three quantities. If that is what you are asking I could show you - tell me. Hamming distance for binary data is known to be identical to squared euclidean distance. $\endgroup$ Commented Sep 30, 2016 at 17:22
  • 1
    $\begingroup$ I can't, however, comment on your example of distance matrix. - It looks strange a bit (all values equal); and I'm likely not familiar with an algorithm algorithm to compute pairwise dissimilarities you mention. $\endgroup$ Commented Sep 30, 2016 at 17:27
  • $\begingroup$ Can you show me how to compute it ? In the Euclidien and Hamming cases. Thank you very much. Can you explain me in details what the center is ? and what the SST means here ? $\endgroup$ Commented Sep 30, 2016 at 17:27
  • 1
    $\begingroup$ Only euclidean case is theoretically correct. If Hamming or whatever distance appears to be actually a synomym (and I know that Hamming is) - then only euclidean case in needed. $\endgroup$ Commented Sep 30, 2016 at 17:29
  • $\begingroup$ I'm not R user and cannot explain you an R code. I can explain by words and formulas. $\endgroup$ Commented Sep 30, 2016 at 17:33

2 Answers 2

4
$\begingroup$

Instruction how you can compute sums of squares SSt, SSb, SSw out of matrix of distances (euclidean) between cases (data points) without having at hand the cases x variables dataset. You don't need to know the centroids' coordinates (the group means) - they pass invisibly "on the background": euclidean geometry laws allow so.

Let $\bf D$ be the N x N matrix of squared euclidean distances between the cases, and $G$ is the N x 1 column of group labels (k groups). Create binary dummy variables, aka N x k design matrix: $\mathbf G=design(G)$.

[I'll accompany the formulas with example data shared with this answer. The code is SPSS Matrix session syntax, almost a pseudocode-easy to understand.]

This is the raw data X (p=2 variables, columns), with N=6 cases: n(1)=3, n(2)=2, n(3)=1 V1 V2 Group 2.06 7.73 1 .67 5.27 1 6.62 9.36 1 3.16 5.23 2 7.66 1.27 2 5.59 9.83 3 ------------------------------------ comp X= {2.06, 7.73; .67, 5.27; 6.62, 9.36; 3.16, 5.23; 7.66, 1.27; 5.59, 9.83}. comp g= {1;1;1;2;2;3}. !seuclid(X%D). /*This function to compute squared euclidean distances is taken from my web-page, it is techically more convenient here than to call regular SPSS command to do it print D. D .0000 7.9837 23.4505 7.4600 73.0916 16.8709 7.9837 .0000 52.1306 6.2017 64.8601 45.0000 23.4505 52.1306 .0000 29.0285 66.5297 1.2818 7.4600 6.2017 29.0285 .0000 35.9316 27.0649 73.0916 64.8601 66.5297 35.9316 .0000 77.5585 16.8709 45.0000 1.2818 27.0649 77.5585 .0000 comp G= design(g). print G. G 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 comp Nt= nrow(G). comp n= csum(G). print Nt. /*This is total N print n. /*Group frequencies Nt 6 n 3 2 1 

Quick method. Use if you want just the above three scalars. As mentioned in here or here, the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. Then follows:

Total sum-of-squares (of deviations from grand centroid): $SS_t= \frac{\sum \bf D}{2N}$, where $\sum$ is the sum in the entire matrix.

Pooled within-group sum-of-squares (of deviations from group centroids): $SS_w= \sum \frac{diag(\bf G'DG)}{2\bf n'}$, where $\bf n$ is the k-length row vector of within-group frequencies, i.e. column sums in $\bf G$. Without the summation $\sum$ you have the k-length column vector: $SS_w$ in each group.

Between-group sum-of-squares is, of course, $SS_b=SS_t-SS_w$.

comp SSt= msum(D)/(2*Nt). print SSt. SSt 89.07401667 comp SSw= diag(t(G)*D*G)/(2*t(n)). print SSw. /*By groups SSw 27.85493333 17.96580000 .00000000 comp SSw= csum(SSw). print SSw. /*And summed (pooled SSw) SSw 45.82073333 comp SSb= SSt-SSw. print SSb. SSb 43.25328333 

Slower method. Use if you will need also to know some multivariate properties of the data, such as eigenvalues of the principal directions spanned by the data cloud (needed, for example, in multidimensional scaling).

Convert $\bf D$ into its double-centered matrix $\bf S$, explained here and here. As noted in the 2nd link, one of its properties is that $trace(\mathbf S)=trace(\mathbf {T})=SS_t$, and (first nonzero) eigenvalues of $\bf S$ are the same as of $\bf T$ - the scatter matrix of the original (or implied, hypothetical) cases x variables data.

comp rmean= (rsum(D)/ncol(D))*make(1,ncol(D),1). comp S= (rmean+t(rmean)-D-msum(D)/ncol(D)**2)/2. print S. S 6.63045 6.58188 -1.46444 .96961 -14.15579 1.43828 6.58188 14.51701 -11.86120 5.54205 -6.09675 -8.68299 -1.46444 -11.86120 13.89118 -6.18427 -7.24447 12.86320 .96961 5.54205 -6.18427 2.76878 2.49338 -5.58955 -14.15579 -6.09675 -7.24447 2.49338 38.14958 -13.14595 1.43828 -8.68299 12.86320 -5.58955 -13.14595 13.11701 comp SSt= trace(S). print SSt. SSt 89.07401667 

Of course, you can do likewise the double centration also on each group distance submatrix; the traces of the $\bf S$s will be each group's SSwithin, which summation yields pooled $SS_w$.

If you need to compute matrix $\bf C$ of squared euclidean distances between group centroids, compute these ingredients:

$\bf E= G'DG$;

$\mathbf Q= \frac{diag(\bf E)n^2}{2}$, (n is a row vector and diag is a column);

$\bf F=n'n$.

Then $\bf C= (E-\frac{Q+Q'}{F})/F$.

comp E= t(G)*D*G. print E. E 167.1296 247.1716 63.1527 247.1716 71.8632 104.6234 63.1527 104.6234 .0000 comp Q= (diag(E)*n&**2)/2. print Q. Q 752.0832 334.2592 83.5648 323.3844 143.7264 35.9316 .0000 .0000 .0000 comp F= sscp(n). print F. F 9.0000 6.0000 3.0000 6.0000 4.0000 2.0000 3.0000 2.0000 1.0000 comp C= (E-(Q+t(Q))/F)/F. print C. C .0000 22.9274 11.7659 22.9274 .0000 43.3288 11.7659 43.3288 .0000 

Of course, $SS_b$ - if you still don't know it - can be obviously obtained from it (within group frequency is the weight).


Bonus instructions. How to compute SSt, SSb, SSw when you do have the original N cases x p variables data X. Many ways are possible. One of the most efficient (fast) matrix way with data of typical size is as follows.

Matrix of group means (centroids), k groups x p variables: $\bf M= \frac{G'X}{n'[1]}$, where $[1]$ is the p-length row of ones; $\bf n$ is a row defined earlier; $\bf G$ also see above; $\bf X$ is the data with columns (variables) centered about their grand means.

Total scatter matrix $\bf T=X'X$, and $SS_t= trace(\mathbf T)$.

Between-group scatter matrix $\bf B=(GM)'(GM)$, and $SS_b= trace(\mathbf B)$.

Pooled within-group scatter matrix $\bf W=T-B$, and $SS_w= trace(\mathbf W)= SS_t-SS_b$.

X with columns centered -2.2333 1.2817 -3.6233 -1.1783 2.3267 2.9117 -1.1333 -1.2183 3.3667 -5.1783 1.2967 3.3817 comp M= (t(G)*X)/(t(n)*make(1,ncol(X),1)). print M. M -1.1767 1.0050 1.1167 -3.1983 1.2967 3.3817 comp Tot= sscp(X). /*T scatter matrix print Tot. print trace(Tot). Tot 37.8299 -3.4865 -3.4865 51.2441 TRACE(Tot) 89.07401667 comp GM= G*M. comp B= sscp(GM). /*B scatter matrix print B. print trace(B). B 8.3289 -6.3057 -6.3057 34.9244 TRACE(B) 43.25328333 comp W= Tot-B. /*W scatter matrix print W. print trace(W). W 29.5011 2.8192 2.8192 16.3197 TRACE(W) 45.82073333 

(If you do not center $\bf X$ initially, $\bf W=T-B$ persists, and gives the same $\bf W$ as before, however $\bf M$, $\bf T$, and $\bf B$ matrices will be different from what before.)

$\endgroup$
6
  • $\begingroup$ Thank you for your answer. Just one point. The grand-centroid and the group-centroid are actually not "observed", and no computed. Unlike anova, where we compare values to mean(s), here we don't it. Right ? We are simply "summing" up the matrices ? I am not sure I fully understand the concept of "centroids" ? $\endgroup$ Commented Oct 1, 2016 at 9:52
  • 1
    $\begingroup$ Word "centroid" in multivariate statistic and in euclidean geomenry is of precise meaning "arithmetic mean" (multivariate, in multivariate situation). So, any plain arithmetic mean is a particular case of centroid. When you have cases by variables data, you can compute these centroids (means) explicitly as the coordinates - i.e. the values on the variables. When you don't have that data but have, say, euclidean distances instead, you can't compute the centroids as coordinates, still, you can compute deviations of data points from the centroids as well as distances between the latter. $\endgroup$ Commented Oct 1, 2016 at 10:24
  • $\begingroup$ I looked at your example here (stats.stackexchange.com/questions/158210/…) and I notice that the SSt computed from a k-means is not the same as the SSt on the raw distance matrix. Why is it so ? $\endgroup$ Commented Oct 1, 2016 at 14:44
  • $\begingroup$ @giacomoV, 1) What is "raw" distance matrix? 2) How do you define SSt - sums of squares or sums of squares of deviations (from centroid)? If you have comments to my other answer please post a comment there, not here. $\endgroup$ Commented Oct 1, 2016 at 14:52
  • $\begingroup$ The sum of the entire matrix (as you explained in the quick method). Is the SSt different in the kmeans case ? I am looking for some references who goes in details in codes and examples but I can't really find any. Do you have any recommendation ? Don't you have a full example, step by step, with codes that I can reproduce by hand? I ran the euclidean distance on the data points you gave in your post just mentioned, but I don't get the same results as your Matrix of squared Euclidean distances. I also can't reproduce the Its sum/2, the sum of the distances = 534.4441000 as you gave it. $\endgroup$ Commented Oct 1, 2016 at 14:58
2
$\begingroup$

Sum of squares is closely tied to Euclidean distance. Hamming (on bits) is a special case, as it is the same as Euclidean (on bits), but you cannot conclude from an arbitrary distance matrix what the SSQ etc. are.

Recall how the sum-of-squares are usually defined, compared to Euclidean distance: $$ SSQ(A,B) = \sum_{a\in A} \sum_{b\in B} \sum_i (a_i-b_i)^2 \\ = \sum_{a\in A} \sum_{b\in B} d^2_{\text{Euclidean}}(a, b) $$

So if your distance matrix stores Euclidean distances (or squared Euclidean), then you can use that second line to compute SSQ.

If you have a different distance function, the result will usually be wrong (unless you use a different definition of "sum of squares" than the usual one; I believe you could use the second line, but this may cause trouble in other situations such as k-means).

$\endgroup$
2
  • $\begingroup$ Your second line $\sum_{a\in A} d^2_{\text{Euclidean}}(a, \mu)$ looks (tautologically with the 1st one) like a distance from point a to centroid mu (i-th? one). But the question of the OP is about pairwise distances between data points. $\endgroup$ Commented Sep 30, 2016 at 22:21
  • $\begingroup$ Yes, I used the center as an example to not have another sum in there. But I will change it to a second set. $\endgroup$ Commented Oct 1, 2016 at 5:17

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.