I'm working with NMR spectra (it's a common chemical test).
There are various peaks of the signal across a range of ppm values. I'm trying to relate the NMR spectra of various samples to a measured property of the samples.
PCA and PLS seem like valuable tools. I'll just talk about PCA for this example, but my question also applies to PLS.
The data that I send to PCA has this structure: each column is a ppm value, each row is a sample, and the value is the sample's signal at that ppm.
For this data that I've made up, it seems like there should be one dominant principal component that describes the size of the peak. This is indeed the case if I don't scale the columns. But if I do scale the columns (which is always what I've done with PCA), the Scree plot makes no sense to me. Why are so many principal components needed to describe the variance?
Is this a problem where I shouldn't scale the columns? What about when I apply to real data, where there are multiple peaks of widely varying heights? Is there a "best practice" for doing PCA and PLS on spectral data? (I don't have a background in biostatistics or chemometrics).
The matrix for PCA
3 3.01 3.02 3.03 3.04 ... sample1 0.002798792 0.0031477185 0.0090656919 0.008150071 0.005397167 ... sample2 0.009681631 0.0006409005 0.0003475848 0.009898928 0.003882373 ... sample3 0.001149334 0.0020842757 0.0057974388 0.008381035 0.004349227 ... sample4 0.009143676 0.0003381718 0.0078882525 0.009483161 0.007021517 ... R code
library(dplyr) library(reshape2) library(ggplot2) library(factoextra) # make up some spectral data D <- tibble(ppm = seq(3, 4, 0.01), sample1 = dnorm(ppm, 3.2, 0.03), sample2 = 0.9 * sample1, sample3 = 0.8 * sample1, sample4 = 0.5 * sample1) Noise <- runif(nrow(D) * (ncol(D) -1), 0, 0.01) D[, -1] <- D[, -1] + Noise # add a little noise so things aren't perfectly correlated # plot the spectra Dm <- melt(D, id.var = "ppm", value.name = "signal", variable.name = "sample") # long format Dm %>% ggplot() + aes(ppm, signal, color = sample) + geom_line() + geom_point() # Prepare data for PCA Dc <- dcast(Dm, sample ~ ppm, value.var = "signal") # each ppm's signal is a variable Dmat <- as.matrix(select(Dc, -c(sample))) # drop the sample column; only want to work on the signals rownames(Dmat) <- Dc$sample # PCA with scaling Ps <- prcomp(Dmat, scale. = TRUE, center = TRUE) fviz_screeplot(Ps) + ggtitle("Scree plot with scaling") # PCA without scaling P <- prcomp(Dmat, scale. = FALSE, center = TRUE) fviz_screeplot(P) + ggtitle("Scree plot without scaling") 

