Suppose I have a set of symbols with expected (i.e. background) probabilities for each, and a set of n observed sequences of these symbols, each of length m. From simply looking over the array of observed counts of symbols in each of these sequences, I notice that a certain symbol (or class of symbols) X looks to be overrepresented in these sequences. I wish to quantify this in bits of information--in other words I want something like "bits of enrichment of X" that express how much some hypothetical "sieve" that is selecting in favor of extra X's (but that "sees" nothing else than the number of Xs in each sequence) has biased these sequences away from random (background) sequences.
The ultimate goal is to be able to divide this number of bits by the total number of bits of information across all positions of the sequences in order to determine what fraction of the selection process that gave rise to these sequences can be explained by a model considering only global selection for more Xs, and how much is remaining that must come from other effects (selection for/against other symbols, position-dependent selection for Xs or other symbols, etc.).
One way to do this would be simply to take the total count of X's in all sequences, sum the binomial probabilities of count(X) through mn Xs appearing out of mn observations given the expected p(X), and then take the negative log2 of that, and divide by the number of sequences. But is this the generally accepted/standard way of doing this? Is there something mathematically simpler or more correct?
When looking about entropy of observed counts vs. a theoretical distribution, I come across the Kullback-Leibler divergence--is that a better measure, or are they even equivalent? I notice that K-L is always positive--and while on one hand the enrichment of a symbol is signed, in that it can be more OR less common than expected, BOTH of these require selection, which corresponds to "doing positive information-theoretic work", so maybe the K-L divergence really is the proper measure here.
Going along with this, is the total count of Xs across all sequences a "sufficient statistic" for computing this, or does it matter the individual frequencies of no X, 1 X, 2 Xs, etc. per sequence?