1
$\begingroup$

I am learning about Assumed Density Filtering (ADF) and Expectation Propagation in the context of bayesian deep neural networks. I have seen in some textbooks and papers that ADF is also called moment matching. Is it called moment matching because we always solve it by minimizing KL Divergence by showing the moments of two distributions match? is this the reason behind it?

$\endgroup$

3 Answers 3

2
$\begingroup$

Hope it's not too late...
I also wandered about this when working on my thesis.
In a nutshell, this is true for any approximating distribution which belongs to the exponential family of distributions.

Recall that in EP we are repeatedly performing the assumed density filtering algorithm (ADF) until convergence. The moment matching happens at each ADF iteration, i.e. when solving:

$$ \min_{\xi} D_{KL}\left( \hat{p}_{i}(\theta|\xi_{i-1})||q(\theta|\xi) \right)$$

Where $\hat{p}_{i}$ is what Minka calls the exact posterior and $q$ is the approximating posterior and in general is assumed to belong to the (non-curved) exponential family of distributions. Hence: $$ q(\theta|\xi) = exp\left( \xi^{T}T(\theta)+S(\theta)+A(\xi) \right)$$

We can easily show that $D_{KL}$ in the optimization problem is convex w.r.t $\xi$ thus the first order condition (i.e. equating the derivative to zero) is sufficient for optimality. If you do the math (we also need to assume we can switch between derivatives and integrals) the first order conditions are:

$$ {E}_{\hat{p}_i}\left[ T(\theta) \right] = \frac{\partial}{\partial \xi}A(\xi)$$

To see why the last equation is interpreted as moment matching notice the for distribution from the non-curved exponential family, the first derivative of the log-partition function $(A(\xi))$ is the expectation of the sufficient statistic $(T(\theta))$. So the first order condition can be written as:

$$ {E}_{\hat{p}_i}\left[ T(\theta) \right] = E_{q}\left[T(\theta) \right]$$

Which is interpreted matching expectation of the sufficient statistic between the exact posterior and the approximating posterior.

$\endgroup$
0
$\begingroup$

Assumed Density Filtering approximates the Bayesian posterior to a distribution we wish to assume. If we assume the posterior to be a Gaussian Approximation, then it can be called Gaussian Density filtering.

So to approximate it, we only need to compute the mean and variance of the resultant distribution and use them as parameters of the gaussian posterior. Hence the name moment matching.

enter image description here

Gaussian-density filtering computes an exact one-step posterior p(θ) and then approximates it to get qnew(θ) which is a Gaussian Distribution. So essentially, we want- enter image description here

A more detailed explanation of this can be found in T.Minka's paper -https://tminka.github.io/papers/ep/minka-thesis.pdf

To see one application of assumed density filtering, you can read this article Understanding the Working of Multi-Armed Bandit in VWO

$\endgroup$
2
  • $\begingroup$ Thank you for your answer. So, you mean that the matching of the moments only occur when the assumed density is a gaussian? what if the assumed density is a Laplace for example? in this case after minimizing KLD, we will get a matching among the moments again. Am I right? $\endgroup$ Commented Jul 2, 2021 at 6:37
  • 1
    $\begingroup$ So the broad idea is because true posterior is intractable we assume a posterior distribution and try to estimate its parameter s.t. our approximated assumed density is close to true posterior. In the case of gaussian, it is simply by matching the mean and variance. The same rule I believe should work for other distributions also like Laplace provided we are able to estimate its parameters from these moment parameters. $\endgroup$ Commented Jul 3, 2021 at 6:12
0
$\begingroup$

The reason this is called moment matching is that the prior & approximate posterior is generally taken to be in the exponential family, the parameters of which can be estimated by matching the appropriate moments. See Zhou, E., Fu, M. C., & Marcus, S. I. (2010). Solving continuous-state POMDPs via density projection. IEEE Transactions on Automatic Control, 55(5), 1101–1116. There are many names for this approach, which has been rediscovered in numerous fields.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.