EM algorithm for two Gaussian models

Question

This is about basic machine learning which I do not understand clearly.

I have 2 Gaussian models $G_1$ and $G_2$ and given a list of data as below.

(1, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 18, 23)

I do not know which numbers were generated by which $G_i$ model, but I want to do Expectation Maximization for clustering.

So, what are the hidden variables? and how to perform Expectation and Maximization steps?

Could you show some calculation or short Python code of the likelihood metric? (Perhaps 2 iteration)

I got the basics of EM algorithm from here, but I cannot apply to this question. http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html

Nicholas Mancuso · Accepted Answer · 2014-04-18 00:00:26Z

$\newcommand{\I}{\mathbb{I}}\newcommand{\E}{\mathbb{E}}$Given our observed variables $x = \{x_i\}$, hidden variables $z = \{z_i\}$ and distribution parameters $\theta = \{(\mu_j, \sigma_j)\}$, let's define the complete-data log-likelihood as $$ \begin{align*} \ell_c(\theta) &= \sum_{i=1}^n \log \Pr[x_i, z_i|\theta] = \sum_{i=1}^n \sum_{j=1}^k \I[z_i = j] \log \Pr[x_i, z_i = j| \theta]\\ &= \sum_{i=1}^n \sum_{j=1}^k \I[z_i = j] \log (\Pr[x_i|\theta_j] \Pr[z_i=j|\theta]) \end{align*} $$ where $\I[\cdot]$ is the indicator function, $\Pr[x_i|\theta_j]$ is the probability of observing $x_i$ given the $j$th Gaussian and $\Pr[z_i=j|\theta]$ is the mixing weight of the $j$th distribution (i.e., the probability of observing the $j$th hidden variable). This is the model we have if we somehow knew which Gaussian generated a given data-point. We obviously don't know, so we would like to compute the expected value of this complete-data log-likelihood. That is, what is the expected outcome of this full-knowledge formulation? To do that let us define the $Q(\theta^t|\theta^{t-1})$ function as $$ \begin{align*} Q(\theta^t|\theta^{t-1}) &= \E\left[\sum_{i=1}^n \sum_{j=1}^k \I[z_i = j] \log \Pr[x_i, z_i = j| \theta]\right]\\ &= \sum_{i=1}^n \sum_{j=1}^k \E[\I[z_i = j]] \log \Pr[x_i, z_i = j| \theta]\\ &= \sum_{i=1}^n \sum_{j=1}^k \Pr[z_i = j | x_i] \log \Pr[x_i, z_i = j| \theta]\\ &= \sum_{i=1}^n \sum_{j=1}^k \underbrace{\Pr[z_i = j | x_i]}_{r_{ij}} \log (\Pr[x_i|\theta_j]\underbrace{\Pr[z_i = j| \theta]}_{\pi_j}).\\ &= \sum_{i=1}^n \sum_{j=1}^k r_{ij} \log \Pr[x_i|\theta_j] + \sum_{i=1}^n \sum_{j=1}^k r_{ij} \log \pi_j. \end{align*} $$ Computing $r_{ij}$ for the E-step is done by simple Bayes theorem, $$ r_{ij} = \frac{\pi_j \Pr[x_i|\theta_j^{t-1}]}{\sum_{j'=1}^k \pi_{j'} \Pr[x_i|\theta_{j'}^{t-1}]}.$$ Now for the M-step we would like to compute the mixing weights $\pi$ and parameters $\theta$ that maximize $Q$. Luckily for us, we can optimize for each independently as seen by the final formulation! By taking the derivative of $Q$ wrt $\pi$ with the Lagrange constraint that $\sum_j \pi_j = 1$ we have $$ \pi_j = \frac{\sum_{i=1}^n r_{ij}}{n}. $$

In order to optimize $Q$ wrt $(\mu_j, \sigma_j)$ lets look at the part only dependent on them (ignoring some constants). $$ \sum_{i=1}^n r_{ij} \log \Pr[x_i|\theta_j] = -\frac{1}{2}\sum_{i=1}^n r_{ij} \left(\log \sigma_j + \frac{(x_i - \mu_j)^2}{\sigma_j}\right) $$ Using this information helps when taking the derivative of $Q$ wrt to each parameter. Finally we have, $$ \begin{align*} \mu_j &= \frac{\sum_{i=1}^n r_{ij} x_i}{\sum_{i=1}^n r_{ij}}, &\sigma_j &= \frac{\sum_{i=1}^n r_{ij} x_i^2}{\sum_{i=1}^n r_{ij}} - \mu_j^2. \end{align*} $$

In your case, $x = (1, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 18, 23)$, $z = (z_1,\dotsc, z_{13})$, $k=2$, and $\theta = ((\mu_1, \sigma_1),(\mu_2, \sigma_2))$. EM begins by randomly initializing $\pi = (\pi_1, \pi_2)$ and $\theta$ and then alternating between E-step and M-step. Once the parameters haven't changed much (or better the observed-data log-likelihood hasn't changed much) you can stop and output your current values $\theta^t$ and $\pi$.

One benefit from using $|\ell(\theta^t) - \ell(\theta^{t-1})|$ over $||\theta^t - \theta^{t-1}||$ as a stopping condition is for bug checking. EM guarantees that the likelihood will always monotonically increase; therefore, you can simply check the values each iteration to much sure that your likelihood is greater than or equal to last iteration's. If it isn't, then something is wrong.

Finally one important thing to note is that computing a MLE for a mixture of Gaussians will tend to overfit. One way to overcome this is by incorporating priors for $\pi$ and $\theta$ and then computing the MAP estimate instead. This would only change the M-step; however, its derivation is dependent on which prior you choose.

EDIT: Since it is clear you are more interested in just a working example than really understanding any of this here is Python code that will run EM for a basic GMM. Running this in ipython gives:

In [1]: import em In [2]: x = (1, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 18, 23) In [3]: em.em(x, 2) Out[3]: (array([ 0.49645528, 0.50354472]), array([ 15.89753149, 5.10207907]), array([ 3.8611819 , 2.65861623]))

where the final output are the mixing weights, means for each Gaussian, and stdevs for each Gaussian.

I need real number example, not only the equations, with randomly selected parameters. Any random number will be fine. Anyway, I've selected it for the best answer. Thank you for your effort. — Karyo
– Karyo, Commented Apr 16, 2014 at 7:30

alto · Accepted Answer · 2014-04-11 15:51:49Z

What you describe is a Gaussian mixture model (GMM) with two components. You can think of the latent variable as a categorical r.v. indicating which Gaussian generated each point. In general if you had $K$ Gaussians you would have a prior $\alpha_k$ and mean and covariance $\mu_k$, $\Sigma_k$ for each $k=1,\ldots,K$. Given i.i.d data $x_1, \ldots, x_n$ the likelihood would is given by $$ P(x_1, \ldots, x_n \mid \theta) = \prod_{i=1}^n \sum_{k=1}^K P(k) P(x_i \mid \mu_k, \Sigma_k) $$ where $P(k) = \alpha_k$ and $P(x_i \mid \mu_i, \Sigma_k)$ is the pdf of a multivariate normal distribution with parameters $\mu_k$, $\Sigma_k$.

In your specific case things are a little simpler since you have only two components and univariate Gaussians. Let $\alpha$ be the prior probability of the first component (the probability of the second is then $1 - \alpha$) and $\mu_1, \sigma_1^2$ and $\mu_2, \sigma_2^2$ be the mean and variance of the two component distributions respectively. The the likelihood is given by $$ P(x_1, \ldots, x_n \mid \theta) = \prod_{i=1}^n \left[\alpha f_1(x_i) + (1 - \alpha) f_2(x_i) \right] $$ where $$ f_k(x_i) = \frac{1}{\sigma_k\sqrt{2\pi}}\exp\left\{-\frac{(x_i - \mu_k)^2}{2\sigma_k^2}\right\}. $$ See this tutorial for an intro to EM which discusses GMMs.

While you've specified the likelihood for this particular problem. It hasn't fully answered OP's question on the expectation and maximization steps. To finish you would need to describe the expected complete-data formulation, no? — Nicholas Mancuso
– Nicholas Mancuso, Commented Apr 11, 2014 at 19:14

Stack Exchange Network

EM algorithm for two Gaussian models

2 Answers 2

Hot Network Questions

EM algorithm for two Gaussian models

2 Answers 2

Related

Hot Network Questions