8
$\begingroup$

Recently, I have learned about the Principle of Maximum Entropy with regards to Probability Distribution - in particular, when certain "information" (i.e. constraints) is available about some class of probability distribution function (e.g. domain over which the probability function is defined, expectation, etc.), we can use the principle of Maximum Entropy to determine the "most informative" probability distribution function from this class of probability distribution functions in this situation.

Apparently, in many real world situations (e.g. when the data is continuous and can take any value between negative infinity and positive infinity) - the Normal Distribution ends up being the probability distribution function with the Maximum Entropy, thus often resulting in the "most informative" choice of probability distribution function when compared to any other candidate.

My Question: Can this fact about the "Maximum Entropy" of the Normal Distribution corresponding to the "most informative" probability distribution function be used to explain its prevalence and popularity in statistics? Perhaps this "most informativeness" property of the normal distribution "naturally" resulted in "more successful applications" (e.g. real world statistical models with higher consistency, higher accuracy and lower variance) and in turn made it more "popular"?

Thanks!

$\endgroup$
4
  • $\begingroup$ The answer is imho yes. (Just as the uniform distribution on a set with two elements is very easily seen to maximize entropy.) The statement among which class of distributions the normal one maximizes entropy is correctly written here and there is a reference. $\endgroup$ Commented May 11, 2022 at 5:37
  • 8
    $\begingroup$ not going to say no, but the central limit theorem might be more important (when the observed quantity is supposedly the sum of many small independent (not necessarily iid) random variables) $\endgroup$ Commented May 11, 2022 at 6:05
  • $\begingroup$ The prevalence of Normal distributions has different explanations depending on where it occurs. For example, my answer here explains why errors are often Normal, summarizing several points due to Jaynes. My discussion therein of his Sec. 7.11 mentions the entropy motivation, as one of several Chapter 7 arguments. $\endgroup$ Commented May 11, 2022 at 22:11
  • $\begingroup$ Any developments on your doubts? Did you check my answer? $\endgroup$ Commented May 18, 2022 at 2:19

1 Answer 1

6
$\begingroup$

So to avoid a merely opinion-based answer, let us examine firstly the entropy of some variable $x \sim \mathcal{N}(\mu, \sigma^2 )$.

$$H(x)=-\int p(x) \log p(x) = -\mathbb{E} \big[\log \mathcal{N}(\mu, \sigma^2) \big] \tag{1, 2}$$ $$= - \mathbb{E}\Bigg[\log \Big[(2\pi\sigma^2)^{-\frac{1}{2}} e^{-\frac{1}{2\sigma^2}(x-\mu)^2} \Big] \Bigg] \tag{3}$$ $$=\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2\sigma^2} \mathbb{E}\big[(x-\mu)^2 \big]\tag{4}$$ $$=\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2} \tag{5}$$

(To understand the transition from $(4)$ to $(5)$ simply remember the definition of variance: $\mathbb{E}\big[(x-\mu)^2 \big]=\sigma^2$.)

As you see, the entropy of a normal distribution is a function of its variance. That is intuitive, because high variance implies higher entropy and vice versa. But what is this telling us?

Firstly, consider how we are defining information:

$$H(X)=\mathbb{E}_{x\sim p}[I(x)] = -\mathbb{E}_{x\sim p}[\log P(x)] = -\int_{-\infty}^\infty p(x)\log p(x)$$

This is a definition that formalizes the notion that unlikely events convey more information than likely events. This formal definition does not always align with more common-sense notions of information (for example, the amount of conclusions we can draw from a certain data). In fact, they may even be contradictory.

For example, imagine two variables $X \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Y\sim \mathcal{N}(\mu_2, \sigma_2^2)$ with $H(X)$ very high and $H(Y)$ very low. $X$ conveys more information, in the formal sense of the definition. However, this is not without problems. The most elementary example is that $\mu_1$ will be far less representative of $X$ than $\mu_2$ of $Y$. At the same time, if you model data with a Gaussian distribution and your model has very high entropy, this may be an indicator of overfitting. In fact, Gaussian models are very prone to overfitting in general.

As you can see, high entropy is not necessarily equal to statistical usefulness. So $1)$ why is entropy valued and $2)$ why is the normal distribution so important?

  1. Maximum entropy is valuable mainly because when we do not know the real function $f$ that defines some data $X$ we want to find a model $\hat{f}$ that has the least possible assumptions. As explained in one of your past questions, high entropy does guarantee this (it is simply entailed in the definition). (Again, see Occam's razor).

  2. The normal distribution is a maximum entropy distribution. That has value with regards to point $1)$, but is not good enough to answer our question. If we couldn't draw important conclusions from a Gaussian distribution, the fact that it is safe to assume such distribution would not be very useful. The normal distribution has properties not related to its entropy that make it extremely valuable, but I will not enumerate them because they are basic and well-known. (Simply consider the value in the central limit theorem!) Because of these properties, a normal distribution is very nice to work with. One can generally draw stronger conclusions with less effort from a normal distribution than from a non-normal distribution (hence the value in normalizing non-normal variables).

So my final answer is no, maximum entropy is not enough to explain the prevalence and popularity of the normal distribution in statistics. Arguably, the central limit theorem plays a greater role at that.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.