Is this an example of where I shouldn't scale before doing PCA / PLS?

Question

I'm working with NMR spectra (it's a common chemical test).

There are various peaks of the signal across a range of ppm values. I'm trying to relate the NMR spectra of various samples to a measured property of the samples.

PCA and PLS seem like valuable tools. I'll just talk about PCA for this example, but my question also applies to PLS.

The data that I send to PCA has this structure: each column is a ppm value, each row is a sample, and the value is the sample's signal at that ppm.

For this data that I've made up, it seems like there should be one dominant principal component that describes the size of the peak. This is indeed the case if I don't scale the columns. But if I do scale the columns (which is always what I've done with PCA), the Scree plot makes no sense to me. Why are so many principal components needed to describe the variance?

Is this a problem where I shouldn't scale the columns? What about when I apply to real data, where there are multiple peaks of widely varying heights? Is there a "best practice" for doing PCA and PLS on spectral data? (I don't have a background in biostatistics or chemometrics).

Spectra

The matrix for PCA

 3 3.01 3.02 3.03 3.04 ... sample1 0.002798792 0.0031477185 0.0090656919 0.008150071 0.005397167 ... sample2 0.009681631 0.0006409005 0.0003475848 0.009898928 0.003882373 ... sample3 0.001149334 0.0020842757 0.0057974388 0.008381035 0.004349227 ... sample4 0.009143676 0.0003381718 0.0078882525 0.009483161 0.007021517 ...

R code

library(dplyr) library(reshape2) library(ggplot2) library(factoextra) # make up some spectral data D <- tibble(ppm = seq(3, 4, 0.01), sample1 = dnorm(ppm, 3.2, 0.03), sample2 = 0.9 * sample1, sample3 = 0.8 * sample1, sample4 = 0.5 * sample1) Noise <- runif(nrow(D) * (ncol(D) -1), 0, 0.01) D[, -1] <- D[, -1] + Noise # add a little noise so things aren't perfectly correlated # plot the spectra Dm <- melt(D, id.var = "ppm", value.name = "signal", variable.name = "sample") # long format Dm %>% ggplot() + aes(ppm, signal, color = sample) + geom_line() + geom_point() # Prepare data for PCA Dc <- dcast(Dm, sample ~ ppm, value.var = "signal") # each ppm's signal is a variable Dmat <- as.matrix(select(Dc, -c(sample))) # drop the sample column; only want to work on the signals rownames(Dmat) <- Dc$sample # PCA with scaling Ps <- prcomp(Dmat, scale. = TRUE, center = TRUE) fviz_screeplot(Ps) + ggtitle("Scree plot with scaling") # PCA without scaling P <- prcomp(Dmat, scale. = FALSE, center = TRUE) fviz_screeplot(P) + ggtitle("Scree plot without scaling")

Scaling is recommended when the variables have a significantly different ranges. For instance, fist variable ranging between 1-10 and second variable ranges between 0-1e6. This is usually not the case for NMR data. Also scaling makes the regression coefficients compariable among variables. But I feel like scaling adds another layer of modeling by adding stddev assumption in addition to mean assumption. Most of the time, applying scaling to spectral data slightly decreased my models' performance. — gunakkoc
– gunakkoc, Commented May 10, 2019 at 13:03
If you are after a predictive model, I would recommend testing scaling see what works better for your case. — gunakkoc
– gunakkoc, Commented May 10, 2019 at 13:05
Arthur, please check my edit. You can revert it if my suggested correction is actually wrong. And welcome to cross validated. — cbeleites
– cbeleites, Commented May 10, 2019 at 22:49
The main problem issue here is that this is a $n \leq p$ PCA. By just looking at the one it is clear that there is only a single PC. We need more PCs if we scale the data because we artificially inflate their variability when centring and the peak at approximately 3.2 no longer dominates the variability of the sample. — usεr11852
– usεr11852, Commented May 10, 2019 at 23:12

cbeleites · Accepted Answer · 2019-05-10 23:12:02Z

As @theGD already pointed out in the comment, scaling is often not needed for spectroscopic data as the features already have a common intensity axis.

Here's my guess what's happening when you scale:

You have spectra with very nice zero baselines. In other words, all those features outside your analyte signal are constant mean + some noise.

If you scale such a feature, this noise will be inflated/amplified until it has unit variance. Just like the scaled feature(s) of your analyte signal.

After scaling, you'll then have lots of noise features of similar variance (i.e. importance) to your analyte signal features. The only thing that can now make your analyte signal stick out of this "forest of noise" is if your analyte signal covers a sufficiently large number of features. But then, if you have many noise features and not that many samples, the chance to have accidentally correlated noise features increases. This will lead to noise-only PCs that nevertheless have substantial eigenvalues (if you have many noise only features and few analyte features, your analyte signal can actually end up in a higher PC!).

Unfortunately, there is nothing that forces noise-only features to be accidentally correlated with noise-only features: we may have accidental correlation with analyte features. In that case the model becomes noisy (unstable).
To be sure, this correlation is present also in the unscaled data, but there only a small amount of noise enters the PC/latent variable with the analyte whereas after scaling the noise is about as large as the signal from a single analyte feature.
Fortunately, we do have the huge advantage in spectroscopy that we can check the loadings: if they look noisy (instead of look like nice difference or maybe derivative spectra), they usually did pick up noise. I.e. we can see when/where our models get unstable.

Together, these the amplification of noise and the sheer number of noise-only features in relation to few analyte (signal) features may explain your observation.

I'd expect PLS to be somewhat less disturbed by such scaling than PCA as PLS looks for variance that is correlated also with the analyte signal, but in general I'd not expect any improvement, neither, for the situation you describe.

If you want to scale your spectra, make sure you first exclude uninformative spectral regions. (But maybe you should anyways do this as this dimensionality reduction will help the modeling with small sample sizes)

Stack Exchange Network

Is this an example of where I shouldn't scale before doing PCA / PLS?

1 Answer 1

Linked

Hot Network Questions

Is this an example of where I shouldn't scale before doing PCA / PLS?

1 Answer 1

Linked

Related

Hot Network Questions