1
$\begingroup$

In GDA we can assume that posterior probability for each of $K$ possible classes is Gaussian with same variance $\Sigma$, and different means $\mu_k$, ie. $$p(\mathbf x|C_k):\mathcal N(\mathbf \mu_k,\mathbf \Sigma), \ k=1,\dots,K.$$ According to Bayes formula we can write $$p(C_k|\mathbf x) = \frac{p(\mathbf x|C_k) \ p(C_k)}{\sum_{j=1}^K p(\mathbf x|C_j)\ p(C_j)} = \frac{\exp(a_k)}{\sum_{j=1}^K \exp(a_j)} $$ where $a_k$ is defined as $$a_k = \ln p(\mathbf x|C_k) p(C_k). $$ What confuse me is that according to Bishop C. book we can represent $a_k$ as a linear model in the following way $$a_k=\mathbf w_k^T \,\mathbf x + w_{k0} $$ with \begin{equation} \mathbf w_k = \mathbf \Sigma^{-1} \mathbf \mu_k, \quad w_{k0}=-\frac{1}{2} \mathbf \mu_k^T \Sigma^{-1} \mathbf \mu_k + \ln p(C_k) \end{equation} because (according to C. Bishop)

"We see that the $a_k$ are again linear functions of $\mathbf x$ as a consequence of the cancellation of the quadratic terms due to the shared covariances."

I don't see how the cancellation of the quadratic terms due to the shared covariances happend?

In my derivation:

\begin{align} a_k &= \ln p(\mathbf x|C_k) p(C_k)\\ &= \ln p(\mathbf x|C_k) + \ln p(C_k)\\ &=\ln(2\pi)^{-\frac{M}{2}}|\mathbf \Sigma|^{- \frac{1}{2}} -\frac{1}{2} (\mathbf x - \mathbf \mu_k)^T \mathbf \Sigma^{-1} (\mathbf x - \mathbf \mu_k)+ \ln p(C_k)\\ &= -\frac{1}{2} \color{red}{\mathbf x^T \mathbf \Sigma^{-1} \mathbf x }+\mathbf x^T\mathbf \Sigma^{-1} \mathbf \mu_k \! -\!\frac{1}{2}\mathbf \mu_k^T\mathbf \Sigma^{-1} \mathbf \mu_k +\ln(2\pi)^{-\frac{M}{2}}|\mathbf \Sigma|^{- \frac{1}{2}} \end{align} I still have quadratic term $\color{red}{\mathbf x^T \mathbf \Sigma^{-1} \mathbf x} $!

Also I got the same linear term but in free term $w_{k0}$ I has additional constant $\ln(2\pi)^{-\frac{M}{2}}|\mathbf \Sigma|^{- \frac{1}{2}} $.

How I can cancel previous quadratic term and get a linear model in multiclass case?

Additional note: For binary case ($K=2$) I already know the answer: we will represent posterior probability with sigmoid function and in that case we will loose quadratic term.

$\endgroup$

1 Answer 1

0
$\begingroup$

my feeling is that that a_k in (4.68) is not the same as the a_k in (4.63). It could be called b_k, anyhow. What is important is that the classification is made according to the highest value of all a_k's (4.68). Since the quadratic term doesn't depend on k, we can use (4.63) instead of (4.68), i.e; we can drop the quadratic term in x. The text is ambiguous (very unlike Bishop!) but the fact that the classifier can be build from a linear form in x in the end is correct.

after a few more thoughts: in fact, the new a_k in (4.68) (or say b_k) is different from the old a_k in (4.63), but softmax (a_k) = softmax(b_k) = p(C_k given x) !!! This is because the quadratic form in x can be factorized at both the numerator and denominator of softmax. We thus can use can use p(C_k given x) = softmax(b_k) = softmax (linear form in x) for classification!

I forgot to mention that softmax(a_k) is not a function of a_k 'alone', it is a function of all a_k together, k = 1:K, but still, the reasoning is correct, I think.

$\endgroup$
2
  • 1
    $\begingroup$ Please don't post multiple replies--just keep improving this one through edits. $\endgroup$ Commented Nov 22, 2016 at 17:43
  • $\begingroup$ I agree with you. $a_k$ in (4.68) in not the same as the $a_k$ in (4.63) which made confusion in the book. Since we use max of all $a_k$ we can drop the quadratic term. And you made a good point that: even we does not calculate softmax($a_k$) we can still calculate probability as softmax($b_k$) because they are the same. $\endgroup$ Commented Nov 23, 2016 at 10:13

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.