1
$\begingroup$

I recently conducted a Principal Component Analysis (PCA) on a dataset with a four-category target variable. While the PCA score plot revealed excellent separation for one group, the remaining three categories exhibited poor differentiation.

For classification, I used the principal components as features in Support Vector Machines (SVM) and Linear Discriminant Analysis (LDA), and achieved very good accuracy with both models. I am hoping to gain some insights from the community regarding this observation. Can someone explain this apparent contradiction? Why did the PCA plot not show clear separation between the three classes, yet the classification algorithms performed well when using the principal components as features?

Thank you for your time and consideration.

$\endgroup$
8
  • $\begingroup$ What did you plot in the PCA score plot? $\endgroup$ Commented Jan 28 at 10:53
  • $\begingroup$ The PCA score plot I visualized shows the data projected onto the first two principal components (PC1 and PC2). $\endgroup$ Commented Jan 28 at 11:48
  • 1
    $\begingroup$ PCA does not have a target variable. The PCA score plot (as described in your comment) is not intended to find separation on some other variable. Also, your score plot showed only two PCs; you didn't tell us how many PCs you used as features, but if it was more than 2, that isn't captured in the plot. $\endgroup$ Commented Jan 28 at 12:02
  • $\begingroup$ You can, of course, use the PCs to do classification (as you did). $\endgroup$ Commented Jan 28 at 12:03
  • $\begingroup$ How many PCs did you use as features? All of them or just the first two from your visualization? Also, how do you assess the accuracy to be high? Do you, for instance, calculate on some holdout data? If so, how do you calculate the features for the holdout data? $\endgroup$ Commented Jan 28 at 12:14

1 Answer 1

1
$\begingroup$

Looking at just the first two PCs discards information present in the features that may relate to the outcome. After all, PCA does not consider the outcome variable.

Below, I give a simulation where the outcome depends on the fourth and fifth principal components. A predictive model that uses all five principal components should be able to perform well in such a situation.

library(ggplot2) set.seed(2025) # Build five correlated features # N <- 1000 p <- 5 X <- matrix(NA, N, p) X[, 1] <- rnorm(N) for (i in 2:p){ X[, i] <- X[, i - 1] + rnorm(N, 0, 1) } # Run PCA and extract the transformed variables # pca <- princomp(X) X_pca <- pca$scores # Simulate an outcome variable (y) that depends on the last two PCs # z <- 5*X_pca[, p] - 5*X_pca[, p - 1] pr <- 1/(1 + exp(-z)) y <- rbinom(N, 1, pr) # Data frame for plotting PCs, colored by group membership # d <- data.frame( y = as.factor(y), PC_1 = X_pca[, 1], PC_2 = X_pca[, 2], PC_3 = X_pca[, 3], PC_4 = X_pca[, 4], PC_5 = X_pca[, 5] ) # Plot the variances # screeplot(pca) # Plot the first two PCs, colored by group # Notice how little separation there is between the groups, despite these # PCs accounting for so much of the total variance in the original features # ggplot(d, aes(x = PC_1, y = PC_2, col = y)) + geom_point() + theme(legend.position = "bottom") # Plot the last two PCs, colored by group # Notice how much separation there is between the groups, despite these # PCs accounting for so little of the total variance in the original features # ggplot(d, aes(x = PC_4, y = PC_5, col = y)) + geom_point() + theme(legend.position = "bottom") 

The first two PCs definitely account for much of the total variance.

screeplot

The first two PCs do not relate to the outcome, however, as the lack of separation shows.

first two PCs

However, there is great separation between the two categories on the last two PCs.

last two PCs

That there is such great separation on the last two PCs speaks to how well the features are separated in the original $5$-dimensional space. A model trained on these original features should be able to pick up on that separation and make predictions with high performance metrics.

$\endgroup$
3
  • $\begingroup$ Dave, I really appreciate your help! Your example was incredibly useful and helped me understand the concepts. I adapted your example for classification and achieved high accuracy, around 85% with SVM. However in my dataset, I'm seeing accuracy as high as 99.5% on my test set in some cases, which made me a bit suspicious at first. Your example was excellent and clearly demonstrated the key concepts. $\endgroup$ Commented Jan 29 at 8:18
  • $\begingroup$ @MamadFasih I’m glad my example was so helpful! The next piece of advice I have is that accuracy is a worse measure of performance than it first seems to be (even for balanced classes). My profile has a few links about this. Perhaps check them out when you want to dive down that rabbit hole. $\endgroup$ Commented Jan 29 at 10:51
  • 1
    $\begingroup$ Thanks so much for the helpful example and the advice! I'll definitely check out the links you've shared. I have already checked some of those links, by the way. $\endgroup$ Commented Jan 29 at 15:28

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.