Skip to main content
added 8 characters in body
Source Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

As an example, consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws $n$ and the number of classes $K$.

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

As an example, consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

As an example, consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws $n$ and the number of classes $K$.

added 15 characters in body
Source Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

ConsiderAs an example, consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

Consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

As an example, consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

added 55 characters in body
Source Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

Consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))} $$$$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))} \end{align}$$$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

Consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))} $$

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_1(x,\theta))+\exp(f_2(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

If we take $n=1$ and $k=K$ and use the probabilities given by the softmax function, then the two definitions match.

Consider $n=1$ and $K=2$. Then the outcome is $y=0$ or $y=1$. Let $p_0$ denote $\mathbb{P}(y=0)$ and $p_1$ denote $\mathbb{P}(y=1)$. By the Kolmogorov axioms, we know $p_0 + p_1 = 1$ and $p_0 \ge 0$ and $p_1 \ge 0$. When the $p_i$ are fixed and the trials are independent, this is a Bernoulli distribution.

In terms of your notation, we know that the $p_i$ are not fixed, but instead vary with the features $x$ and the parameters $\theta$. So we can write $$ p_i = \frac{\exp(f_i(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} $$ where the $f_i$ are the outputs of the neural network.

It's somewhat cumbersome, but we can write this in the form of a multinomial distribution. Most of the values that appear in the multinomial form can be simplified.

$$\begin{align} \mathbb P(y=0) &=\frac{1!}{(1-0)!0!} \left(\frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^1 \left(\frac{\exp(f_1(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))}\right)^0 \\ &= \frac{\exp(f_0(x,\theta))}{\exp(f_0(x,\theta))+\exp(f_1(x,\theta))} \end{align}$$

and similarly for the case $\mathbb P(y=1)$. This example can be generalized to the multinomial case.

In all of these circumstances, the $p_i$ are the only thing that you are estimating (as functions of features and parameters). This model assumes that you know the number of draws and the number of classes.

added 1071 characters in body
Source Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405
Loading
Source Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405
Loading