2
$\begingroup$

Suppose I have a set of symbols with expected (i.e. background) probabilities for each, and a set of n observed sequences of these symbols, each of length m. From simply looking over the array of observed counts of symbols in each of these sequences, I notice that a certain symbol (or class of symbols) X looks to be overrepresented in these sequences. I wish to quantify this in bits of information--in other words I want something like "bits of enrichment of X" that express how much some hypothetical "sieve" that is selecting in favor of extra X's (but that "sees" nothing else than the number of Xs in each sequence) has biased these sequences away from random (background) sequences.

The ultimate goal is to be able to divide this number of bits by the total number of bits of information across all positions of the sequences in order to determine what fraction of the selection process that gave rise to these sequences can be explained by a model considering only global selection for more Xs, and how much is remaining that must come from other effects (selection for/against other symbols, position-dependent selection for Xs or other symbols, etc.).

One way to do this would be simply to take the total count of X's in all sequences, sum the binomial probabilities of count(X) through mn Xs appearing out of mn observations given the expected p(X), and then take the negative log2 of that, and divide by the number of sequences. But is this the generally accepted/standard way of doing this? Is there something mathematically simpler or more correct?

When looking about entropy of observed counts vs. a theoretical distribution, I come across the Kullback-Leibler divergence--is that a better measure, or are they even equivalent? I notice that K-L is always positive--and while on one hand the enrichment of a symbol is signed, in that it can be more OR less common than expected, BOTH of these require selection, which corresponds to "doing positive information-theoretic work", so maybe the K-L divergence really is the proper measure here.

Going along with this, is the total count of Xs across all sequences a "sufficient statistic" for computing this, or does it matter the individual frequencies of no X, 1 X, 2 Xs, etc. per sequence?

$\endgroup$
6
  • $\begingroup$ Chi-square test? $\endgroup$ Commented Oct 6 at 6:02
  • $\begingroup$ @user2974951 What I'm looking for is NOT a statistical test but a means of decomposing a complex, position-dependent set of features (i.e. effectively a "sequence logo") into a global compositional component plus residual position-dependent features, in order to aid insight/interpretability. I did some algebra work in the mean time that seems to show that the Kullback-Leibler divergence possesses the necessary additivity properties to enable this. $\endgroup$ Commented Oct 6 at 23:17
  • $\begingroup$ Not an expert but I agree that Kullback-Leibler divergence should work here. $\endgroup$ Commented Oct 7 at 20:52
  • $\begingroup$ I am not fully sure about your setting and your concrete aim, here. Before looking deeper into this I have some questions: are all symbols supposed to have the same background probability (but do not based on your observation)? $\endgroup$ Commented Oct 8 at 7:32
  • 2
    $\begingroup$ I think you should use Kullback–Leibler divergence; it gives the information-theoretic "bits of enrichment". Divergence is calculated on the binary (X/not-X) distribution. The total info $\sum_j D_{KL}(\hat{P}_j||Q)$ decomposes neatly as $mD_{KL}(\bar{P}||Q)+\sum_j D_{KL}(\hat{P}_j||\bar{P})$, where $Q$ is background, $\hat{P}_j$ is observed at position $j$, and $\bar{P}$ is the global average. The first term $mD_{KL}(\bar{P}||Q)$ is the global compositional bias (your enrichment) and The second term is the remaining position-specific info. This is the standard log-likelihood-ratio measure. $\endgroup$ Commented Oct 8 at 15:07

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.