Return to Revisions

3 of 5

added 55 characters in body

edited Nov 10, 2023 at 21:28

95.8k
23
246
405

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

Consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

answered Nov 7, 2023 at 23:13

Sycorax ♦

95.8k
23
246
405