FAMD on large mixed dataset: low explained variance, still worth using?

Question

I'm working with a large tabular dataset (~1.2 million rows) that includes 7 qualitative features and 3 quantitative ones. For dimensionality reduction, I'm using FAMD (Factor Analysis for Mixed Data) in R, using FactoMineR and factoextra.

I've tried several encoding strategies and grouped categories to reduce sparsity, but the best I can get is 4.5% variance explained by the first component, and 2.5% by the second.

My main goal is to use the 2D representation for distance-based analysis (e.g., clustering, similarity), though it would be great if it could also support some modeling.

Has anyone here used FAMD in a similar context? Is it normal to get such low explained variance with mixed data? Would you still proceed with it, or consider other approaches?

Thanks!

jarbet · Accepted Answer · 2025-09-05 17:42:45Z

Low % variance explained means there is low correlation between your features, i.e. your features don't share much information. You can investigate this yourself by checking the associations between each feature, perhaps using some kind of standardized effect size since you have mixed types of features. My guess is your features are only weakly associated with each other.

In any case, if your goal is to make a univariate summary measure of your 7 features (the first component or the 2nd component), then FAMD/PCA are still a great approach to use because they are giving you the best linear combination of features that explains the most information.

Lastly, I don't expect this to increase the % variance explained much, but it could be worth trying this PCA method for mixed data too.

EDIT: since you want to do clustering, why not try a clustering method that supports mixed types of features? You only have 7 features so this should be pretty easy to do.

Hi @jarbet, I tried the PCAmixdata too and the results were marginally better on the first component but not on the second one. > since you want to do clustering, why not try a clustering method that supports mixed types of features? > You only have 7 features so this should be pretty easy to do. What methods do you advise? I am also interested in computing the similarity, but given that is multidimensional that was one of the reasons I intended on doing FAMD to get a 2D representation. — Duarte Silva
– Duarte Silva, Commented Sep 5 at 20:21

Stack Exchange Network

FAMD on large mixed dataset: low explained variance, still worth using?

1 Answer 1

Hot Network Questions

FAMD on large mixed dataset: low explained variance, still worth using?

1 Answer 1

Related

Hot Network Questions