My question is about using UMAP as a dimensional reduction technique before HDBSCAN clustering. I have a dataset of ~5000 observations each with ~20 descriptors. According to HDBSCAN guidelines, HDBSCAN clusters decently on less than 50 dimensions, so in theory I should be able to cluster directly on the raw data. However, when I do so, HDBSCAN classifies about a quarter of observations as noise and the density-based cluster validation (DBCV) implementation in the HDBSCAN API gives a validity index of ~0.2.
In contrast, clustering on a 2-dimensional UMAP embedding yields a validity index of ~0.8, which is much better. Additionally, all observations are clustered (e.g. there is no noise). I've experimented with UMAP embedding into intermediate dimensions and different HDBSCAN hyperparameters, such as 10-dimensions or 5-dimensions, but it seems that clustering on 2-dimensions yields a slightly better DBCV score than anything else.
It just seems a little suspicious to me that there is absolutely no noise when clustering on the 2D UMAP embedding, although the clustering does pass the eye-test. Additionally, a consistent trend is that the lower the minimum cluster size hyperparameter for HDBSCAN (e.g., min_clust_size < 40), the higher the DBCV score, which sort of makes sense given that smaller min cluster sizes allow for more precise clustering, but it seems undesirable for data exploration to have 20 different clusters with only 10 members each. Does UMAP artificially eliminate noise? Is there a better metric than DBCV to evaluate cluster quality?