Skip to main content
3 of 5
added 55 characters in body
Sycorax
  • 95.8k
  • 23
  • 246
  • 405

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

Consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

Sycorax
  • 95.8k
  • 23
  • 246
  • 405