You might want to have a look at the Wallis derivation.
https://en.wikipedia.org/wiki/Principle_of_maximum_entropy#The_Wallis_derivation
It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept.
The wikipedia page is excellent, but let me add a simple example to illustrate the idea.
Suppose you have a dice. If the dice is fair, the average value of the number shown will be 3.5. Now, imagine to have a dice for which the average value shown is 5.
How can it achieve that? Well, it could do it in zillion ways! It could for example show 5 every time. Or it could show 4, 5, 6 with equal probability.
But an interesting way of doing could be this. Imagine to have a fair dice. You roll it many times (say 100) and you get a bunch of numbers. If the average of these numbers is 5, you accept the sample. Otherwise you reject it and try again.
After many attempts, you finally get a sample with the average that you want. Which numbers are in that sample? Well, you expect that 1 for example will be present a little bit, but probably not 1/6 of the times, because a 1 will lower that average of the sample and it will increase the probability of the sample to be rejected.
In the limit of a very big sample, the numbers will be distributed according to this:
which is the distribution with maximum entropy among the once with specified mean. Ah-ah!