Revisions to PCA on correlation or covariance?

added 104 characters in body

edited Oct 10, 2018 at 11:27

109.1k
37
325
350

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix is equivalent to standardisesstandardizing each of the datavariables (to mean 0 and standard deviation 1). In general they, PCA with and without standardizing will give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon[,-8] # look at heptathlon data (excluding 'score' variable)

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon[,-8], scale=TRUE) hep.PC.cov = prcomp(heptathlon[,-8], scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $82\%$ of the variance) and PC2 is almost equal to javelin (together they explain $97\%$). PCA on correlation is much more informative and reveals some structure in the data and relationships between variables (but note that the explained variances drop to $64\%$ and $71\%$).

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix standardises the data. In general they give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon[,-8] # look at heptathlon data (excluding 'score' variable)

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon[,-8], scale=TRUE) hep.PC.cov = prcomp(heptathlon[,-8], scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $82\%$ of the variance) and PC2 is almost equal to javelin (together they explain $97\%$). PCA on correlation is much more informative and reveals some structure in the data and relationships between variables (but note that the explained variances drop to $64\%$ and $71\%$).

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix is equivalent to standardizing each of the variables (to mean 0 and standard deviation 1). In general, PCA with and without standardizing will give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon[,-8] # look at heptathlon data (excluding 'score' variable)

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon[,-8], scale=TRUE) hep.PC.cov = prcomp(heptathlon[,-8], scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $82\%$ of the variance) and PC2 is almost equal to javelin (together they explain $97\%$). PCA on correlation is much more informative and reveals some structure in the data and relationships between variables (but note that the explained variances drop to $64\%$ and $71\%$).

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

fixed my mistake in the previous edit (inserted [,-8] back)

Source Link

edited Sep 30, 2015 at 12:58

amoeba

109.1k
37
325
350

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix standardises the data. In general they give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlonheptathlon[,-8] # look at heptathlon data (excluding 'score' variable)

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# PCA # scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlonheptathlon[,-8], scale=TRUE) hep.PC.cov = prcomp(heptathlonheptathlon[,-8], scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $98\%$$82\%$ of the variance) and PC2 is almost equal to javelin (together they explain $97\%$). PCA on correlation looks much more informativePCA on correlation is much more informative and reveals some structure in the data and relationships between variables (but note that the explained variances drop to $64\%$ and $71\%$).

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix standardises the data. In general they give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon # look at heptathlon data

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# PCA # scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon, scale=TRUE) hep.PC.cov = prcomp(heptathlon, scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $98\%$ of the variance) and PC2 is almost equal to javelin. PCA on correlation looks much more informative and reveals some structure in the data and relationships between variables.

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix standardises the data. In general they give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon[,-8] # look at heptathlon data (excluding 'score' variable)

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon[,-8], scale=TRUE) hep.PC.cov = prcomp(heptathlon[,-8], scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $82\%$ of the variance) and PC2 is almost equal to javelin (together they explain $97\%$). PCA on correlation is much more informative and reveals some structure in the data and relationships between variables (but note that the explained variances drop to $64\%$ and $71\%$).

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

note about the explained variance when PCA is done on covariance

Source Link

edited Sep 29, 2015 at 11:28

amoeba

109.1k
37
325
350

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix standardises the data. In general they give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon # look at heptathlon data

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# PCA # scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon, scale=TRUE) hep.PC.cov = prcomp(heptathlon, scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $98\%$ of the variance) and PC2 is almost equal to javelin. PCA on correlation looks much more informative and reveals some structure in the data and relationships between variables.

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix standardises the data. In general they give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon # look at heptathlon data

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# PCA # scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon, scale=TRUE) hep.PC.cov = prcomp(heptathlon, scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m and PC2 is almost equal to javelin. PCA on correlation looks much more informative and reveals some structure in the data and relationships between variables.

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

Using the correlation matrix standardises the data. In general they give different results. Especially when the scales are different.

As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

library(HSAUR) heptathlon # look at heptathlon data

This outputs:

 hurdles highjump shot run200m longjump javelin run800m Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51 John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12 Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20 Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24 Choubenkova (URS) 13.51 1.74 14.76 23.93 6.32 47.46 127.90 ...

Now let's do PCA on covariance and on correlation:

# PCA # scale=T bases the PCA on the correlation matrix hep.PC.cor = prcomp(heptathlon, scale=TRUE) hep.PC.cov = prcomp(heptathlon, scale=FALSE) biplot(hep.PC.cov) biplot(hep.PC.cor)

Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $98\%$ of the variance) and PC2 is almost equal to javelin. PCA on correlation looks much more informative and reveals some structure in the data and relationships between variables.

Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.