HDBSCAN on UMAP output

Question

My question is about using UMAP as a dimensional reduction technique before HDBSCAN clustering. I have a dataset of ~5000 observations each with ~20 descriptors. According to HDBSCAN guidelines, HDBSCAN clusters decently on less than 50 dimensions, so in theory I should be able to cluster directly on the raw data. However, when I do so, HDBSCAN classifies about a quarter of observations as noise and the density-based cluster validation (DBCV) implementation in the HDBSCAN API gives a validity index of ~0.2.

In contrast, clustering on a 2-dimensional UMAP embedding yields a validity index of ~0.8, which is much better. Additionally, all observations are clustered (e.g. there is no noise). I've experimented with UMAP embedding into intermediate dimensions and different HDBSCAN hyperparameters, such as 10-dimensions or 5-dimensions, but it seems that clustering on 2-dimensions yields a slightly better DBCV score than anything else.

It just seems a little suspicious to me that there is absolutely no noise when clustering on the 2D UMAP embedding, although the clustering does pass the eye-test. Additionally, a consistent trend is that the lower the minimum cluster size hyperparameter for HDBSCAN (e.g., min_clust_size < 40), the higher the DBCV score, which sort of makes sense given that smaller min cluster sizes allow for more precise clustering, but it seems undesirable for data exploration to have 20 different clusters with only 10 members each. Does UMAP artificially eliminate noise? Is there a better metric than DBCV to evaluate cluster quality?

0. Welcome to CV.SE. 1. Good question (+1) your suspicions are full reasonable. 2. As you recognise passing an eye-test can be helpful but does not help us "cluster quality" in a meaningful way, please see my answer below where I expand on this. — usεr11852
– usεr11852, Commented Jun 25, 2023 at 0:07

usεr11852 · Accepted Answer · 2023-06-25 00:05:43Z

The uncomfortable truth is that any "validity index" for clustering is mostly pointless if it does not relate to anything that is really actionable. Based on our understanding of the data we need to pragmatically validate our clustering results. Do they tell us anything new or even half-reasonable? CV.SE has some great threads on the matter: eg. see How to select a clustering method? How to validate a cluster solution (to warrant the method choice)? and Can any dataset be clustered or does there need to be some sort of pattern in the data? for starters.

In general, clustering is used first and foremost about structure discovery, about uncovering hidden patterns. Particular to the use case described now, if we cluster a 2D embedding from some dimensionality reduction algorithm (UMAP, t-SNE, LLE, whatever) but we have no way to understand if the clustering translates to anything meaningful in the original data domain, irrespective of what our favourite metric of "clustering validity" suggest, that clustering is pretty useless for EDA purposes. (It might still be good for predictive purposes but that's not our main point here.) In that sense, DBCV just says that we have clear clusters in terms of their density and shape properties. Whether or not those clusters though reflect anything non-trivial or just noise, that is for the analyst to decide.

The above being said and focusing again to the use case described, yes, UMAP does to some extent "artificially eliminate noise". As it tries to retain as much structure as possible, as we lower the number of embedding dimensions, only the strongest signal is retained as it is more easily captured. That means that weaker signal/structure is not retained, whether that reflect real information or just background noise is unclear. Notice that almost every dimensionality reduction algorithm has to do this though! So in that sense, the fact we observe lower dimensions having better DBCV is expected.

So all in all, there is no better metric for clustering usefulness. Most these metrics are helpful to say if "a clustering is there" not whether "a reasonable clustering is there". Combining therefore a clustering procedure (e.g. HDBSCAN) with a dimensionality reduction technique (e.g. UMAP) that really tries to pack points together, can give us interesting results but we need some thinking regarding the results' usefulness as no domain-agnostic metric (DBCV, Silhouette Width, etc.) will ever provide that.

Thanks for the in-depth answer. I understand that using a metric such as DBCV is not sufficient to determine whether a clustering makes sense or not. I've seen it often repeated on other CVSE threads that ultimately what matters is if the clustering "makes sense" or not. However, even if I come up with a reasoning for the clustering which sounds convincing to me, I'm not sure how I could justify it (in a paper, for example) without relying on some objective metric. There's no ground truth to compare my clusterings to, so sometimes it feels like I'm just seeing shapes in clouds. — IvyBlue
– IvyBlue, Commented Feb 8, 2024 at 1:41
Yes, that it is correct. Indeed, clustering "sometimes (...) feels like (we are) just seeing shapes in clouds." That's why it is unsupervised rather than supervised learning. Without getting too philosophical, clustering is "good" when it can be used further. If just create a clustering, and we say "this is a nicely packed cluster" but we derive no further conclusion of it... Eh.... who cares? Particularly for the paper example you mentioned: if a clustering allows us to locate previously unknown subgroups of some characteristic, that's good, if it is just a nice plot... why publish it? — usεr11852
– usεr11852, Commented Feb 8, 2024 at 1:49

Stack Exchange Network

HDBSCAN on UMAP output

1 Answer 1

Linked

Hot Network Questions

HDBSCAN on UMAP output

1 Answer 1

Linked

Related

Hot Network Questions