Maximum Likelihood Estimation of Dirichlet Mean

Question

Consider the problem of computing a Maximum-Likelihood estimate of the parameters to a finite Dirichlet distribution, given a set of multinomial observations (probability vectors) assumed to have been sampled from a Dirichlet. The following paper provides an iterative fixed-point algorithm to estimate the mean and precision of the Dirichlet separately:

Minka, Thomas. Estimating a Dirichlet distribution. (2000): 3.

The algorithm for estimating the mean $\mathbf{m}$, given fixed precision $s$, is summarized as follows:

Estimate the full concentration parameters $\pmb{\alpha}=s\mathbf{m}$ by inverting the digamma function.
$\forall\, k$, set $m_k^{new} = \frac{\alpha_k}{\sum_j \alpha_j}$.
Repeat until convergence.

Why must we resort to an iterative algorithm to find the mean? Is the average of our observed data vectors not an accurate estimate of the mean? Further, is it also not true that the expected value of the mean of a set of samples from a Dirichlet is the mean of the Dirichlet itself?

Any insight here would be appreciated!

That's a strange question: if you are looking for the MLE, the answer does not have to be the moment estimator. — Xi'an
– Xi'an, Commented Dec 26, 2014 at 8:28

guy · Accepted Answer · 2014-12-26 00:29:56Z

Suppose $\mathbf p_1, \ldots, \mathbf p_n$ are iid $\operatorname{Dirichlet}(s \mathbf m)$. If I'm understanding you correctly, your question is "why use an iterative scheme when $\hat {\mathbf m} = \frac 1 N \sum_{i = 1} ^ N \mathbf p_i$ works?" You are correct that this is a reasonable estimator. But it isn't the maximum likelihood estimator, which is what we care about! The Dirichlet likelihood is $$ L_i(\pmb \alpha) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k\Gamma(\alpha_k)} \prod_k p_{ik}^{\alpha_k - 1} $$ so our goal is to maximize $\prod_i L_i (\pmb \alpha)$ in $\pmb \alpha$; once we do this, we can get the maximum likelihood estimate of $\mathbf m$ by normalizing. But it is easy to see that the likelihood is a function of $\frac 1 N \sum_i \log \mathbf p_i$ rather than $\frac 1 N \sum_i \mathbf p_i$ (I'm using $\log$ elementwise here). In some sense, we might think of $\log \mathbf p_i$ as the "appropriate scale" of the data - at least, for the Dirichlet distribution - rather than the untransformed $\mathbf p_i$.

So, we believe that the MLE is not $\frac 1 N \sum_i \mathbf p_i$ but rather is some complicated function of $\frac 1 N \sum_i \log \mathbf p_i$. The question now becomes "why use the MLE rather than the easy estimator?" Well, we have some theorems which say the MLE has certain optimality properties. So, we get a more efficient estimator with the MLE, although $\frac 1 N \sum_i \mathbf p_i$ may still be useful as a starting point for the iterative algorithm. Now, I'm not sure how good the MLE really is here, considering that the data must actually be Dirichlet distributed for it to work, whereas $\frac 1 N \sum \mathbf p_i$ is consistent no matter what. But that is another story.

So, (iirc) we know from "some theorems" that the MLE is consistent, and so will asymptotically converge in probability to the "true" parameter. Can we say the same about the simple estimator $\hat{\mathbf{m}}$ as well, i.e. will the two estimators converge to the same value? Will MLE converge to sample mean? Or neither? — Benjamin Bray
– Benjamin Bray, Commented Dec 26, 2014 at 1:33
If the data truly are Dirichlet distributed, then the both $\hat {\mathbf m}$ and the MLE $\hat {\mathbf m}_{MLE}$ will converge to the correct answer. If the data are not Dirichlet distributed then $\hat {\mathbf m}$ will converge to the correct answer, while the MLE of $\pmb \alpha$ (say $\hat {\pmb \alpha}$) will converge to the value $\pmb \alpha_0$ which corresponds to the Dirichlet distribution closest (in the sense of KL-Divergence) to the true distribution. But the best Dirichlet approximation need not have the same mean as the true distribution, so it might be inconsistent. — guy
– guy, Commented Dec 26, 2014 at 1:40
(...) Also it is worth mentioning that when the data are Dirichlet distributed the MLE will be more efficient. The MLE is the asymptotically "optimal" estimate of the parameter of interest (for a suitable definition of "optimal"). — guy
– guy, Commented Dec 26, 2014 at 1:44

Stack Exchange Network

Maximum Likelihood Estimation of Dirichlet Mean

1 Answer 1

Hot Network Questions

Maximum Likelihood Estimation of Dirichlet Mean

1 Answer 1

Related

Hot Network Questions