Is the principle of maximum entropy misleading?

Question

If a distribution belongs to a certain class, then the distribution with the largest entropy in that class is typically referred to as the least-informative distribution. To me, this his highly confusing. We have the following definition of self-information (or information content):

$$I(X) = -\text{log}(P(X))$$

More often than not, this is referred to as surprise, but I prefer the term information (after all, it's called information theory not surprise theory).

Entropy, then, is the expected surprise or expected information of a random variable

$$H(X) = \mathbb{E}[I(X)] = \mathbb{E}[-\text{log}(P(X))]$$

So, the distribution which has the maximum entropy in a given class is the distribution that is, on average, "the most surprising", but that should mean, given our setup, that this is the distribution which has the "most information" on average by the definition that we've given of self-information.

EDIT In the Wikipedia article on this topic, there is the following phrase:

Consider a discrete probability distribution among $m$ mutually exclusive propositions. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero.

My understanding is that, in information theory, a sure event has zero information content. So, when we say that the maximum entropy distribution is the least informative are we using a definition of information that is the exact opposite of the definition set out in information theory?

All this boils down to the following question: when we say that "a maximum entropy distribution is the least informative distribution in a given class" is this in fact using the opposite definition set out in information theory? Or is the definition of information actually compatible with this and I am missing something?

The distribution is "least-informative" in the sense that it provides the least information on a random variable, which equivalently means that an observation provides the "most information". — J. Delaney
– J. Delaney, Commented Apr 27, 2023 at 18:42
Can you please elaborate on this equivalence? Also is there some other definition of "information" that you are using when you say "least information on a random variable"? Or are you meaning that you are using the definition of information as provided in the first equality of the question? — Mr Saltine
– Mr Saltine, Commented Apr 27, 2023 at 18:46
Is this a question or a rant? If a question, could you clarify what is the question? — Tim
– Tim, Commented Apr 27, 2023 at 19:02

jbowman · Accepted Answer · 2023-04-27 20:32:40Z

The "information" referred to in the definition of entropy is not the information contained in the distribution, but the information that is contained in an observation from the distribution relative to the distribution, writing informally. Consider the maximum entropy distribution on $(0,1)$, which is the Uniform distribution, and compare it to, say, $x \sim \text{Beta}(100,100)$ distribution. With the latter distribution, we know the random variable will be very close to $0.5$ with considerable confidence! So... observing $x$ tells us very little because we already know pretty much where in $(0,1)$ $x$ will be. But we have no idea where in $(0,1)$ the r.v. will be if we only have the Uniform distribution to hand, so observing $x$ updates our information quite a lot.

Entropy is related to "surprise" in that the expected increase in information due to observing something is maximized for the maximum entropy distribution. In an anthropomorphic sense, the more we've learned from a single observation, the more likely we are to have been surprised by it.

So to say that a maximum entropy distribution in a given class is "least informative" is to say that, on average, the information gained from a random draw from such a distribution is greater than the information gained from a random draw from any other distribution in that class? — Mr Saltine
– Mr Saltine, Commented Apr 27, 2023 at 21:29
Exactly so! The distribution is least informative = the resulting observation is most informative. Sometimes, using concepts developed in other fields leads to unclear terminology when it's ported over due to pre-existing terminology in the field the concepts are being ported to. — jbowman
– jbowman, Commented Apr 27, 2023 at 21:51

Stack Exchange Network

Is the principle of maximum entropy misleading?

1 Answer 1

Hot Network Questions

Is the principle of maximum entropy misleading?

1 Answer 1

Related

Hot Network Questions