4
$\begingroup$

I have a soil microbiome study in which we obtained two profiles (200 samples x 100,000 features and the same 200 samples x 5000 features, respectively), as well as a number of soil parameters measured. We cannot, in principle, subset the obtained features anymore, so we need to work with a huge dimension table.

One of the analyses that we want to do is a sort of comparison between both profiles, to see if one of them could be better at detecting differences between samples than the other.
And with other analyses we are willing to answer this question: if our microbiome abundance profiles are separating the samples in different groups, do any of these groups contain samples that follow a specific pattern of environmental parameters? Are these samples, for instance, acid + arid + low-phosphate soils? Or else the microbiome groups are not related to any of our measures at all? So it is a kind of multivariate association statistical analysis.

We have tested different methodologies but now we are trying an approach based on UMAP + HDBSCAN. In summary, I made a bunch of UMAPs with different hyperparameter combinations, each with 100 different seeds, and performed HDBSCAN for every resultant UMAP. Then I calculated Davies-Bouldin index for each resultant clustering and picked the one with the lowest value, for maximizing cluster separation and homogeneity, on each microbial profile.

The problem is that I realized that UMAP coordinates are highly variable depending on: (a) the input subcomposition; (b) the hyperparameter combination; and (c) the seed used; thus changing a lot the final clustering. So, I am wondering if we are using UMAP correctly:

  • Is it important to select the best cluster separation (i.e., DB index) for each microbial profile or am I just making a selection equally arbitrary as sticking to a specific hyperparameter combination from the beginning?
  • Would you recommend to use other ways to select the best clustering for each profile, such as a combination of the DB index and the number of clusters generated, or other things, or would it be equally arbitrary?
  • Are there more appropriate methods for doing what we want or is this fine? We saw that PCA is not a good idea because of the huge amount of features we have.

I tried to explain the situation the best I could, and I know that this question might be difficult to answer. I am just searching for some advice since this methodology is pretty new to me. Although I spent a lot of time researching on this, even within other related questions in this forum, I did not find a useful answer.

$\endgroup$
2
  • $\begingroup$ Welcome to Cross Validated! I wonder if you might be better off starting with the associations between features and the "phenotypic parameters." Presumably the sample phenotypes are what matter the most, and it seems that you are doing the clustering (with the potential difficulties you rightly note) as a step in that direction. Focus on the associations with phenotypes would also make comparison of the usefulness of the two profiles more straightforward. $\endgroup$ Commented Oct 8 at 17:11
  • $\begingroup$ Thank you! I need to explain better what we are trying to see at the phenotype level, I will edit the question to reflect this. The question we are willing to answer is this: if my microbiome abundance profile is separating my samples in different groups, does any of these groups contain samples that follow a specific pattern of environmental parameters? Are these samples, for instance, acid + arid + low-phosphate soils? Or else the microbiome groups are not related to any of our measures at all? So the problem is that I do not see how to separate this from the clustering analysis. $\endgroup$ Commented Oct 8 at 17:55

1 Answer 1

4
$\begingroup$

One of the analyses that we want to do is a sort of comparison between both profiles, to see if one of them could be better at detecting differences between samples than the other.

You don't need to perform clustering for that. Clustering can be valuable for many purposes, but if your goal is to find features that distinguish samples then you should look for features that combine low measurement variance with high variance among samples. One problem with UMAP or t-SNE is that the visual distances between clusters don't represent the true distances between clusters that you would need to evaluate differences between (clustered) samples. See this similar question, its answer, and the links.

... we are willing to answer this question: if our microbiome abundance profiles are separating the samples in different groups, does any of these groups contain samples that follow a specific pattern of environmental parameters?

There might be better ways to answer this question than by clustering on the features in the two profiles.

If you want to identify samples that follow a "specific pattern of environmental parameters," you might instead cluster samples based on the "environmental parameters." Maybe even better, you could take the approach used for decades in transcriptomic analysis: use the "environmental parameters" as predictors in a regression model of the (admittedly high-dimensional) microbiome features. That will identify features within each profile that are most strongly associated with those "environmental parameters." The classic limma package in Bioconductor could probably be repurposed to that end.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.