124
$\begingroup$

I'm doing principal component analysis on my dataset and my professor told me that I should normalize the data before doing the analysis. Why?

  • What would happen If I did PCA without normalization?
  • Why do we normalize data in general?
  • Could someone give clear and intuitive example which would demonstrate the consequences of not normalizing the data before analysis?
$\endgroup$
13
  • 35
    $\begingroup$ If some variables have a large variance and some small, PCA (maximizing variance) will load on the large variances. For example if you change one variable from km to cm (increasing its variance), it may go from having little impact to dominating the first principle component. If you want your PCA to be independent of such rescaling, standardizing the variables will do that. On the other hand, if the specific scale of your variables matters (in that you want your PCA to be in that scale), maybe you don't want to standardize. $\endgroup$ Commented Sep 4, 2013 at 9:20
  • 6
    $\begingroup$ Watch out: normalize in statistics sometimes carries the meaning of transform to be closer to a normal or Gaussian distribution. As @Glen_b exemplifies, it is better to talk of standardizing when what is meant is scaling by (value - mean)/SD (or some other specified standardization). $\endgroup$ Commented Sep 4, 2013 at 9:37
  • 10
    $\begingroup$ Ouch, that 'principle' instead of 'principal' in my comment up there is going to drive me crazy every time I look at it. $\endgroup$ Commented Sep 4, 2013 at 9:59
  • 16
    $\begingroup$ @Glen_b In principle, you do know how to spell it. Getting it right all the time is the principal difficulty. $\endgroup$ Commented Sep 4, 2013 at 10:07
  • 1
    $\begingroup$ These are multiple questions so there is no one exact duplicate, but every one of them is extensively and well discussed elsewhere on this site. A good search to begin with is on pca correl* covariance. $\endgroup$ Commented Sep 4, 2013 at 17:50

2 Answers 2

108
$\begingroup$

Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. The first plot below shows the amount of total variance explained in the different principal components wher we have not normalized the data. As you can see, it seems like component one explains most of the variance in the data.

Without normalization

If you look at the second picture, we have normalized the data first. Here it is clear that the other components contribute as well. The reason for this is because PCA seeks to maximize the variance of each component. And since the covariance matrix of this particular dataset is:

 Murder Assault UrbanPop Rape Murder 18.970465 291.0624 4.386204 22.99141 Assault 291.062367 6945.1657 312.275102 519.26906 UrbanPop 4.386204 312.2751 209.518776 55.76808 Rape 22.991412 519.2691 55.768082 87.72916 

From this structure, the PCA will select to project as much as possible in the direction of Assault since that variance is much greater. So for finding features usable for any kind of model, a PCA without normalization would perform worse than one with normalization.

With normalization

$\endgroup$
6
  • 18
    $\begingroup$ You explain standardizing not normalization but anyway good staff here :) $\endgroup$ Commented Nov 8, 2014 at 19:09
  • $\begingroup$ @Erogol that is true. $\endgroup$ Commented Nov 18, 2014 at 22:09
  • 3
    $\begingroup$ Great post! Perfectly reproduceable with skelarn. BTW, USArrests dataset can be downloaded from here vincentarelbundock.github.io/Rdatasets/datasets.html $\endgroup$ Commented Apr 27, 2017 at 12:23
  • 1
    $\begingroup$ @gary this is a covariance matrix, not a correlation matrix, therefore the diagonal elements are not necessarily equal to 1. $\endgroup$ Commented Aug 7, 2019 at 19:45
  • 1
    $\begingroup$ For some reason normalization and standardization are used quite interchangeably. Good explanation. Short story, just subtract the mean and we're good to! $\endgroup$ Commented Dec 8, 2021 at 10:11
24
$\begingroup$

The term normalization is used in many contexts, with distinct, but related, meanings. Basically, normalizing means transforming so as to render normal. When data are seen as vectors, normalizing means transforming the vector so that it has unit norm. When data are though of as random variables, normalizing means transforming to normal distribution. When the data are hypothesized to be normal, normalizing means transforming to unit variance.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.