I have a soil microbiome study in which we obtained two profiles (200 samples x 100,000 features and the same 200 samples x 5000 features, respectively), as well as a number of soil parameters measured. We cannot, in principle, subset the obtained features anymore, so we need to work with a huge dimension table.
One of the analyses that we want to do is a sort of comparison between both profiles, to see if one of them could be better at detecting differences between samples than the other.
And with other analyses we are willing to answer this question: if our microbiome abundance profiles are separating the samples in different groups, do any of these groups contain samples that follow a specific pattern of environmental parameters? Are these samples, for instance, acid + arid + low-phosphate soils? Or else the microbiome groups are not related to any of our measures at all? So it is a kind of multivariate association statistical analysis.
We have tested different methodologies but now we are trying an approach based on UMAP + HDBSCAN. In summary, I made a bunch of UMAPs with different hyperparameter combinations, each with 100 different seeds, and performed HDBSCAN for every resultant UMAP. Then I calculated Davies-Bouldin index for each resultant clustering and picked the one with the lowest value, for maximizing cluster separation and homogeneity, on each microbial profile.
The problem is that I realized that UMAP coordinates are highly variable depending on: (a) the input subcomposition; (b) the hyperparameter combination; and (c) the seed used; thus changing a lot the final clustering. So, I am wondering if we are using UMAP correctly:
- Is it important to select the best cluster separation (i.e., DB index) for each microbial profile or am I just making a selection equally arbitrary as sticking to a specific hyperparameter combination from the beginning?
- Would you recommend to use other ways to select the best clustering for each profile, such as a combination of the DB index and the number of clusters generated, or other things, or would it be equally arbitrary?
- Are there more appropriate methods for doing what we want or is this fine? We saw that PCA is not a good idea because of the huge amount of features we have.
I tried to explain the situation the best I could, and I know that this question might be difficult to answer. I am just searching for some advice since this methodology is pretty new to me. Although I spent a lot of time researching on this, even within other related questions in this forum, I did not find a useful answer.