Skip to main content
1 of 2
AndreaL
  • 541
  • 2
  • 6

You might want to have a look at the Wallis derivation.

https://en.wikipedia.org/wiki/Principle_of_maximum_entropy#The_Wallis_derivation

It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept.

The wikipedia page is excellent, but let me add a simple example to illustrate the idea.

Suppose you have a dice. If the dice is fair, the average value of the number shown will be 3.5. Now, imagine to have a dice for which the average value shown is 5.

How can it achieve that? Well, it could do it in zillion ways! It could for example show 5 every time. Or it could show 4, 5, 6 with equal probability.

But an interesting way of doing could be this. Imagine to have a fair dice. You roll it many times (say 100) and you get a bunch of numbers. If the average of these numbers is 5, you accept the sample. Otherwise you reject it and try again.

After many attempts, you finally get a sample with the average that you want. Which numbers are in that sample? Well, you expect that 1 for example will be present a little bit, but probably not 1/6 of the times, because a 1 will lower that average of the sample and it will increase the probability of the sample to be rejected.

In the limit of a very big sample, the numbers will be distributed according to this:

https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution#Discrete_distributions_with_specified_mean

which is the distribution with maximum entropy among the once with specified mean. Ah-ah!

AndreaL
  • 541
  • 2
  • 6