How does one choose an optimal alphabet for finding a Huffman encoding?

Question

Huffman encoding will perform best when the distribution of symbols of an alphabet that the string to be encoded uses is dyadic.

Given an arbitrary bit string S, how can we find the best alphabet for encoding? Suppose S is an ASCII file. Then given the regularity of 1-byte characters that such files exhibit, we would expect that an optimal, or at least pretty good, alphabet should contain, say, 8-bit or 16-bit words (which we then build codes for after constructing the Huffman tree).

Is there an algorithm for finding the optimal word width (assume we use constant-length words).

I would guess that to evaluate an alphabet, it would only be fair if we considered the costs of storing the actual encoding as well. This addresses the case where the alphabet is just one symbol - the entire original string. Technically the message would just be one bit, but the the encoding tree that's stored would have to indicate that the one bit used is a code for the original string, so we've just increased our message by two bits trivially!

(Constant-length encoding information such as width size, encoding table size, etc., need not be considered for the comparisons, of course).

what do you mean "dyadic"? it works when some symbols are more common than others and the more common symbols are given "abbreviations". this is true of human languages. yes a simple algorithm is to try the encoding on different width strings & see which is optimal. it depends on the data of course. — vzn
– vzn, Commented Jun 1, 2014 at 3:42
@vzn A dyadic distribution is one for which the frequencies of the symbols are 2^-i for i between 1 and n, and another symbol with frequency 2^-n exists as well. This is the optimal case for Huffman encoding since the encoding tree is maximally skewed. Regarding your brute force solution, this certainly works but isn't very satisfying. — VF1
– VF1, Commented Jun 1, 2014 at 4:14
disagree that optimal huffman encoding is for dyadic frequencies (not sure where that assertion comes from, it would be better if you cite something for that). optimal encoding for all compression algorithms is when there is huge redundancy. afaict/afaik huffman encoding is generally applied to language text where there are natural coding "widths" (character sizes) and other applications are not very common. also not having to determine optimal width & instead assuming it is a "feature" not a "bug" of this compression method. Lempel-Ziv is used more for variable coding widths. try that one! — vzn
– vzn, Commented Jun 1, 2014 at 18:11
@vzn Wiki page, look under "main properties". It's pretty clear a dyadic leads to the most compression since then the codes are spread out entirely throughout the Huffman trie, weighted to the top. Not sure how LZW is relevant to my question at all, though. I'm not asking this because I have a file to compress, I'm investigating this particular compression algorithm. — VF1
– VF1, Commented Jun 1, 2014 at 20:48
You ask if there is an algorithm to find the optimal word width, and the answer is trivially yes: just try all possibilities. If that is not a suitable answer, then you haven't posed your question carefully enough: you must have some additional requirements you didn't list in the question -- what are they, and what is the motivation for them? — D.W.
– D.W. ♦, Commented Jun 2, 2014 at 6:19

Yuval Filmus · Accepted Answer · 2014-06-01 05:52:17Z

The size required to store the Huffman code table scales like the number of codewords. We expect the number of unique $k$-letter words to be exponential in $k$, in fact roughly $2^{kH}$, where $H$ in the source entropy, though since the file is not infinite, for large $k$ we will actually see less. Still, this suggests that for logarithmically large $k$, most of the $k$-letter strings will be almost unique, and so the overall compression for such a $k$ would typically be quite small. In view of that, you can just try several values of $k$, and choose the best one. After you do some such experiments, you can formulate and perhaps prove a hypothesis as to the optimal value of $k$ in different situations.

So it seems that there is no well-established method for doing this? That's very peculiar, given how popular Huffman encoding is. In practice, does everyone just choose a magic word width and stick with it? — VF1
– VF1, Commented Jun 1, 2014 at 8:11
@VF1 There might be a well-established method, but I am not aware of it (it's outside my expertise). However, it's not peculiar at all, since there are other compression methods such as Lempel–Ziv and its many variants which you can think of as "adaptive" Huffman coding. Huffman coding is probably usually done on a byte-by-byte basis. By the way, to avoid the dyadic skew you can use arithmetic coding. — Yuval Filmus
– Yuval Filmus, Commented Jun 1, 2014 at 8:25

Wandering Logic · Accepted Answer · 2014-06-01 17:45:04Z

I'm having trouble answering your question for two reasons. First, the entropy changes as you change the alphabet, so the "best" alphabet depends on the correlations between characters in the class of strings that you are trying to encode, not just the "dyadicness". (This is the problem with the notion of entropy: it depends on your model of what you know about the method by which a string was generated, it is not a fundamental property of a string.) So I can't think of an algorithm that would do better than "try every character length and see which one ends up with the best result."

Second it's not clear (to me) why you are trying to choose an alphabet that gets closest to having a "dyadic" probability distribution (I had to look that term up), when in practice the main reason that people use Huffman coding is that there is an adaptive version (i.e., one that doesn't need to store the encoding,) which is "good enough", not because it is optimal. In practice (for example in the old Unix pack utility, or in the Huffman coding done at the end of MPEG encoding), the input alphabet size is chosen to be some "natural" size (bytes if you are encoding Unix files in the 1980s, some much larger alphabet if you know you are encoding Unicode).

If you want a non-adaptive encoding with a fixed-width alphabet (given that you are unaware that there should be any correlation between symbols) then you should use arithmetic coding, which gets closer to optimal for non-dyadic distributions.

If you have reason to believe that there are correlations between nearby characters then you might use something like PPM, and if you believe that there are likely to be repeated substrings (but not necessarily nearby) you might use some some kind of Lempel-Ziv compression. (Various kinds of Lempel-Ziv are used by Unix's gzip and compress).

I am aware that there are alternatives in practice. Let me reiterate my question for clarity: given a bit string, what Huffman encoding width produces the best ratio? Selecting a width is the same as selecting an alphabet in the context of my question. I agree (re your point of character correlation) that a variable-width alphabet may be even better, but that would really complicate things. I'll keep the question open unless you're certain there's only the brute force approach. Again, this pertains to Huffman coding only, I don't see the relevance of LZW. — VF1
– VF1, Commented Jun 1, 2014 at 20:44
My point wasn't that a variable-width alphabet is better. It's that "better" is more about inter-character correlation than dyadicness. But I don't have any proof that there isn't something faster than brute-force. — Wandering Logic
– Wandering Logic, Commented Jun 2, 2014 at 11:37

Stack Exchange Network

How does one choose an optimal alphabet for finding a Huffman encoding?

2 Answers 2

Hot Network Questions

How does one choose an optimal alphabet for finding a Huffman encoding?

2 Answers 2

Related

Hot Network Questions