Skip to main content
deleted 1 character in body
Source Link
user32398
user32398

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $\mathbf{R}$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observationsobservation as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $\mathbf{R}$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observations as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $\mathbf{R}$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observation as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

added 9 characters in body
Source Link
user32398
user32398

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $R$$\mathbf{R}$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observations as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $R$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observations as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $\mathbf{R}$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observations as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

added 1 character in body
Source Link
user32398
user32398

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $R$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observations as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.95$$pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $R$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observations as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.95$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

UNTRANSFORMED (RAW) DATA: If you have variables with widely varying scales for raw, untransformed data, that is, caloric intake per day, gene expression, ELISA/Luminex in units of ug/dl, ng/dl, based on several orders of magnitude of protein expression, then use correlation as an input to PCA. However, if all of your data are based on e.g. gene expression from the same platform with similar range and scale, or you are working with log equity asset returns, then using correlation will throw out a tremendous amount of information.

You actually don't need to think about the difference of using the correlation matrix $\mathbf{R}$ or covariance matrix $\mathbf{C}$ as an input to PCA, but rather, look at the diagonal values of $\mathbf{C}$ and $\mathbf{R}$. You may observe a variance of $100$ for one variable, and $10$ on another -- which are on the diagonal of $\mathbf{C}$. But when looking at the correlations, the diagonal contains all ones, so the variance of each variable is essentially changed to $1$ as you use the $\mathbf{R}$ matrix.

TRANSFORMED DATA: If the data have been transformed via normalization, percentiles, or mean-zero standardization (i.e., $Z$-scores), so that the range and scale of all the continuous variables is the same, then you could use the Covariance matrix $\mathbf{C}$ without any problems. (correlation will mean-zero standardize variables). Recall, however, that these transformations will not remove skewness (i.e., left or right tails in histograms) in your variables prior to running PCA. Typical PCA analysis does not involve removal of skewness; however, some readers may need to remove skewness to meet strict normality constraints.

In summary, use the correlation matrix $R$ when within-variable range and scale widely differs, and use the covariance matrix $\mathbf{C}$ to preserve variance if the range and scale of variables is similar or in the same units of measure.

SKEWED VARIABLES: If any of the variables are skewed with left or right tails in their histograms, i.e., the Shapiro-Wilk or Lilliefors normality test is significant $(P<0.05)$, then there may be some issues if you need to apply the normality assumption. In this case, use the van der Waerden scores (transforms) determined from each variable. The van der Waerden (VDW) score for a single observation is merely the inverse cumulative (standard) normal mapping of the observation's percentile value. For example, say you have $n=100$ observations for a continuous variable, you can determine the VDW scores using:

  1. First, sort the values in ascending order, then assign ranks, so you would obtain ranks of $R_i=1,2,\ldots,100.$
  2. Next, determine the percentile for each observations as $pct_i=R_i/(n+1)$.
  3. Once the percentile values are obtained, input them into the inverse mapping function for the CDF of the standard normal distribution, i.e., $N(0,1)$, to obtain the $Z$-score for each, using $Z_i=\Phi^{-1}(pct_i)$.

For example, if you plug in a $pct_i$ value 0.025, you will get $-1.96=\Phi^{-1}(0.025)$. Same goes for a plugin value of $pct_i=0.975$, you'll get $1.96=\Phi^{-1}(0.975)$.

Use of VDW scores is very popular in genetics, where many variables are transformed into VDW scores, and then input into analyses. The advantage of using VDW scores is that skewness and outlier effects are removed from the data, and can be used if the goal is to perform an analysis under the contraints of normality -- and every variable needs to be purely standard normal distributed with no skewness or outliers.

added 6 characters in body
Source Link
user32398
user32398
Loading
added 6 characters in body
Source Link
user32398
user32398
Loading
added 1965 characters in body
Source Link
user32398
user32398
Loading
edited for readibility
Source Link
amoeba
  • 109.1k
  • 37
  • 325
  • 350
Loading
Source Link
user32398
user32398
Loading