1
$\begingroup$

I have a set of data associated with ~60 individuals. For each individual, I have sequence data for a number of different genes. I have performed clustering analysis (using affinity propagation) for each gene, based on the number of pairwise differences between sequences.

This means that for each gene, I have a number of clusters to which each individual is assigned. However, the cluster membership may be completely different for different genes.

My question is:

How do I assess how well conserved the clustering is between genes? That is, is there some metric or statistic that will give me a measure of whether the cluster grouping is conserved between different genes?

To put it slightly differently, suppose Alice and Bob both belong to the same cluster when considering Genes 1, 4 and 5, but different clusters when considering Genes 2 and 3. How can I determine if this is the same as would be expected if all gene sequences are independent of each other, and if not, is there a metric that gives the "strength" of such a relationship (being in the same cluster across multiple genes).

I'm imagining that I will need to assess the correlation between a set of matrices describing the clustering for each gene, but I am unsure if there is a standard approach for this type of problem.

Note: I am not necessarily looking for a complete solution, but rather some pointers in the right direction. I have struggled to turn up anything useful in the usual google searches.

$\endgroup$
3
  • 2
    $\begingroup$ This might help you as using this metric you can compare cluster labels: scikit-learn.org/stable/modules/generated/… $\endgroup$ Commented Jan 7, 2017 at 20:55
  • $\begingroup$ @Tom, can you put that response in an answer so I can award you the bounty? $\endgroup$ Commented Jan 10, 2017 at 22:14
  • $\begingroup$ Hi Andrew, I added the answer. Hopefully it will help you! $\endgroup$ Commented Jan 12, 2017 at 16:04

1 Answer 1

2
$\begingroup$

The Adjusted Rand Index can calculate the agreement between two cluster labelings, even if the labels don't match. Scikit Learn has a good implementation of this. The original paper describing this index is Hubert and Arabie, 1985 [1].

This might be a good point to start your investigation:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score

[1] Hubert, Lawrence, and Phipps Arabie. 1985. “Comparing Partitions.” Journal of Classification 2 (1). Springer-Verlag: 193–218.

$\endgroup$
2
  • 1
    $\begingroup$ This is being automatically flagged as low quality, probably because it is so short. At present it is more of a comment than an answer by our standards. Can you expand on it? We can also turn it into a comment. $\endgroup$ Commented Jan 12, 2017 at 16:57
  • $\begingroup$ Sorry guys, that's probably my fault. I had a bounty up on this question, @Tom's original comment was very helpful, so I asked him to put it in an answer so I could give him the bounty (the bounty has expired in the meantime :( ). I've added edits to his answer to make it a bit more complete - I hope this is ok. $\endgroup$ Commented Jan 12, 2017 at 22:19

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.