39
$\begingroup$

Consider a Jeffreys prior where $p(\theta) \propto \sqrt{|i(\theta)|}$, where $i$ is the Fisher information.

I keep seeing this prior being mentioned as a uninformative prior, but I never saw an argument why it is uninformative. After all, it is not a constant prior, so there has to be some other argument.

I understand that it does not depend on reparametrization, which brings me to the next question. Is it that the determinant of the Fisher information does not depend on reparametrization? Because Fisher information definitely depends on the parametrization of the problem.

Thanks.

$\endgroup$
5
  • $\begingroup$ Have you read the Wikipedia article? en.wikipedia.org/wiki/Jeffreys_prior $\endgroup$ Commented Feb 22, 2011 at 23:12
  • 3
    $\begingroup$ Yes, I had looked there. perhaps I am missing something, but I do not feel that the Wikipedia article gives an adequate answer to my questions. $\endgroup$ Commented Feb 22, 2011 at 23:23
  • $\begingroup$ See also stats.stackexchange.com/questions/38962/… $\endgroup$ Commented Oct 10, 2012 at 11:34
  • 2
    $\begingroup$ Note that the Jeffreys prior is not invariant with respect to equivalent models. For example Inference about a parameter $p$ is different when using binomial or negative binomial sampling distributions. This is despite the likelihood functions being proportional and the parameter having the same meaning in both models. $\endgroup$ Commented Oct 11, 2012 at 6:41
  • $\begingroup$ In an important sense, there's no such thing as a "constant" prior unless it's a discrete uniform distribution. There is such a thing as a density that is constant with respect to a measure, but the same density is not constant with respect to other measures. $\endgroup$ Commented Oct 31 at 13:57

6 Answers 6

18
$\begingroup$

It's considered noninformative because of the parameterization invariance. You seem to have the impression that a uniform (constant) prior is noninformative. Sometimes it is, sometimes it isn't.

What happens with Jeffreys' prior under a transformation is that the Jacobian from the transformation gets sucked into the original Fisher information, which ends up giving you the Fisher information under the new parameterization. No magic (in the mechanics at least), just a little calculus and linear algebra.

$\endgroup$
2
  • 10
    $\begingroup$ I disagree with this answer. Using a subjective prior is also a parametrization invariant procedure ! $\endgroup$ Commented Oct 10, 2012 at 11:26
  • $\begingroup$ Stephane, would you mind confirming my understanding based on your comment? I think what you mean is that, defining a prior subjectively, you can still transform back and forth between another parameterization using Jacobians. So the "invariant" thing about Jeffreys priors is the fact that starting from any parameterization, you maybe obtain a prior that is automatically related to the others by the "correct" rule (i.e. Jacobians), due to the way Fisher Information matrices transform. So, since no parameterization is "special" during prior construction, it is considered uninformative. Correct? $\endgroup$ Commented Feb 5, 2024 at 21:29
44
$\begingroup$

The Jeffreys prior coincides with the Bernardo reference prior for one-dimensional parameter space (and "regular" models). Roughly speaking, this is the prior for which the Kullback-Leibler divergence between the prior and the posterior is maximal. This quantity represents the amount of information brought by the data. This is why the prior is considered to be uninformative: this is the one for which the data brings the maximal amount of information.

By the way I don't know whether Jeffreys was aware of this characterization of his prior ?

$\endgroup$
6
  • 5
    $\begingroup$ " Roughly speaking, this is the prior for which the Kullback-Leibler divergence between the prior and the posterior is maximal." Interesting, I did not know that. $\endgroup$ Commented Oct 10, 2012 at 14:57
  • 3
    $\begingroup$ (+1) Good answer. It would be nice to see some references of some of your points (e.g. 1, 2). $\endgroup$ Commented Oct 10, 2012 at 18:10
  • 2
    $\begingroup$ @Procrastinator I am currently writing a new post about noninformative priors ;) Please wait, perhaps a few days. $\endgroup$ Commented Oct 10, 2012 at 18:14
  • 3
    $\begingroup$ @StéphaneLaurent did you ever write that post? $\endgroup$ Commented Dec 30, 2019 at 17:44
  • 3
    $\begingroup$ @StéphaneLaurent I know it's been a decade but that post would still be very helpful :-) $\endgroup$ Commented Sep 5, 2022 at 15:48
8
$\begingroup$

This is an old but interesting topic. I recently thought about this and developed a take that I would like to share.

First off, the problem with flat priors as uninformative priors is that this idea is rooted in the way we would guess a number; not the way the data guess a number in likelihood-based inference.

We can understand this by comparing two binomial random variables: \begin{eqnarray} X &\sim& Bi(x\mid n=10,\theta=.5)\\ Y &\sim& Bi(y\mid n=10,\theta=.9) \end{eqnarray} Clearly, $E[X]=5$ and $E[Y]=9.$

The likelihood of finding $E[Y]=9$ under the distribution of $X$ is $\approx 0.01$, while the likelihood of finding $E[X]=5$ under the distribution of $Y$ is $\approx 0.0015$.

This fact is independent of parametrization (odds, log odds). For example, if $\phi$ are odds, sample odds of $.9/.1=9$ give not as much evidence against $H_0: \phi=1$ as sample odds of $1$ gives against $H_0: \phi=9$.

Hence, finding $x=5$ is much better at excluding $H_0:\theta=.9$ than finding $x=9$ is at excluding $H_0:\theta=.5$ (just considering one random variable $X \sim Bi(x\mid n,\theta)$ from now on). More generally, intermediate $x$ exclude extreme $\theta$ very well, but extreme $x$ do not exclude intermediate $\theta$ so well. This notion is formalized in Fisher's information, which is the expected curvature of the log likelihood given some $\theta$. The expected curvature of the log likelihood for binomial random variables equals \begin{equation} \frac{-n}{\theta(1-\theta)}. \end{equation} Referring back to the example above, it is readily verified that the curvature is equal to $-4n$ at $\theta=.5$, but $\approx-11n$ at $\theta=.9$. More curvature means that fewer values of $X$ are compatible with that value of $\theta$, so it is easier to find evidence against that value of $\theta$ (implying lower posterior density in the Bayesian setting).

Fisher's information is different for different parameterizations, but that's because it is on a different scale: the curvature may be different, but so is the distance between points. The net result is invariance under a transformation.

The key point of using Jeffreys' prior then seems to be that if we do not want to help the data making its decision, we should give less weight to points that are hard to find evidence against, and more weight to points that are easy to find evidence against (e.g., it would be unjust to give a lot of weight to $\theta=0.5$ because it is hard to exclude this point from the posterior anyway). We do so by taking the prior distribution proportional to the expected curvature of the likelihood, which is the square root of Fisher information if we parametrize the prior over $\theta$ (since Fisher information runs over $\theta^2$).

In the Binomial case, this gives a Beta distribution with parameters $.5$ and $.5.$ This distribution gives less weight to intermediate values of $\theta$ (values close to $0.5,$ which are hard to throw out of the posterior anyway) and more weight to extreme values of $\theta$ (values close to $0$ or $1,$ which are easy to throw out of the posterior).

From here, I see two ways forward. The first is to reject the notion of uninformative priors altogether because the Bayesian posterior is still different from the frequentist likelihood. The second is to say that by using the Jeffreys prior we finally have a method under which all values of $\theta$ are equally likely before we have seen the data (under frequentist likelihood-based inference, they are not). If I read Jeffreys' 1946 paper, it seems to be all about invariance under transformations. I can see how that is a necessary condition for a prior to be uninformative, but I'm not sure about its sufficiency. I'm not aware of Jeffreys wishing to correct a deficiency of likelihood-based frequentist inference (granted, I haven't looked very much), but that does seem to be the corollary. Take your pick.

$\endgroup$
3
  • $\begingroup$ Great answer! I was trying to go down that same line of thought using the Gaussian distribution prior (en.wikipedia.org/wiki/…) and your post helped me out immensely. $\endgroup$ Commented Oct 8, 2020 at 23:12
  • $\begingroup$ Great @thc! I corrected some minor mistakes and added a line about necessity versus sufficiency. $\endgroup$ Commented Oct 21, 2020 at 18:36
  • 1
    $\begingroup$ Very interesting post. But how do you make precise the claim " by using the Jeffreys prior we finally have a method under which all values of θ are equally likely before we have seen the data"? Thanks. $\endgroup$ Commented Oct 9, 2021 at 6:20
7
$\begingroup$

I'd say it isn't absolutely non-informative, but minimally informative. It encodes the (rather weak) prior knowledge that you know your prior state of knowledge doesn't depend on its parameterisation (e.g. the units of measurement). If your prior state of knowledge was precisely zero, you wouldn't know that your prior was invariant to such transformations.

$\endgroup$
5
  • 1
    $\begingroup$ I am confused. In what sort of case would you know that you prior should depend on the model parameterization? $\endgroup$ Commented Apr 30, 2013 at 19:42
  • 2
    $\begingroup$ If we want to predict longevity as a function of body weight, using a GLM, we know that the conclusion should not be affected whether we weigh the subject in kg or lb; if you use a simple uniform prior over the weights you might get different outcome depending on the units of measurement. $\endgroup$ Commented May 1, 2013 at 10:08
  • 2
    $\begingroup$ That's a case when you know that it shouldn't be affected. What is a case where it should? $\endgroup$ Commented May 3, 2013 at 17:21
  • 1
    $\begingroup$ I think you are missing my point. Say we don't know anything about the attributes, not even that they have units of measurement to which the analysis should be invariant. In that case your prior would encode less information about the problem than the Jeffrey's prior, hence the Jeffrey's prior is not completely uninformative. The may or may not be situations where the analysis should not be invariant to some transformation, but that is beside the point. $\endgroup$ Commented May 3, 2013 at 17:51
  • 2
    $\begingroup$ N.B according to the BUGS book (p83), Jeffrey's himself referred to such transformation invariant priors as being "minimally informative", which implies that he saw them as encoding some information about the problem. $\endgroup$ Commented May 3, 2013 at 18:06
1
$\begingroup$

Instead of spending time thinking about transformations and non-informative priors I prefer to think of having “no prior”. This stems from the fact that in Bayesian posterior inference the log posterior sums the log likelihood and the log prior. If there is no prior (e.g., you are using Stan software and just omit any mention of a prior) the log posterior comes solely from the log-likelihood. With Bayes, information from the data and information from the prior are completely interchangeable, and the posterior distribution can be derived from a data augmentation argument. Not having prior data is equivalent to not having a prior.

$\endgroup$
0
$\begingroup$

Ref: "Weighing the odds" by Williams.

Consider random variable ${ [X \vert \Theta = \theta] }$ ${ \sim f(x \vert \Theta = \theta) }$ and parameter ${ \Theta \sim \pi _{\Theta} (\theta) . }$

Suppose the densities ${ f(x \vert \Theta = \theta) }$ are fixed.

Suppose we are free to choose ${ \pi _{\Theta} (\theta) . }$

Note that the Fisher information of ${ \Theta }$ is

$${ I _{\Theta} (\theta) = \int _{- \infty} ^{+ \infty} \left( \frac{\partial}{\partial \theta} \ln f(x \vert \Theta = \theta) \right) ^2 f ( x \vert \Theta = \theta) \, dx . }$$

Note that ${ I _{\Theta} (\theta) }$ might vary with ${ \theta . }$

The goal is to find a reparameterisation ${ \alpha }$ such that

$${ \text{Want:} \quad I _{\alpha(\Theta)} (\alpha(\theta)) \, \, \text{ doesn't vary with } \alpha(\theta) . }$$

Let ${ \alpha }$ be such a reparameterisation.

Note that in general

$${ {\begin{aligned} &\, I _{\alpha(\Theta)}(\alpha(\theta)) \\ = &\, \int _{- \infty} ^{+ \infty} \left( \frac{\partial}{\partial \alpha(\theta)} \ln f(x \vert \alpha(\Theta) = \alpha(\theta)) \right) ^2 f ( x \vert \alpha(\Theta) = \alpha(\theta)) \, dx \\ = &\, \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) ^{-2} \int _{- \infty} ^{+ \infty} \left( \frac{\partial}{\partial \theta} \ln f(x \vert \Theta = \theta) \right) ^2 f ( x \vert \Theta = \theta) \, dx \\ = &\, \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) ^{-2} I _{\Theta} (\theta) . \end{aligned}} }$$

Hence in general

$${ \boxed{ I _{\alpha(\Theta)}(\alpha(\theta)) \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) ^2 = I _{\Theta} (\theta) } . }$$

Hence

$${ \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) \propto \sqrt{I _{\Theta} (\theta) } . }$$

How is the new parameter ${ \alpha (\Theta) }$ distributed?

Note that in general

$${ \boxed{{\begin{aligned} &\, \pi _{\alpha(\Theta)} (\alpha(\theta)) \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) = \pi _{\Theta} (\theta) \end{aligned}}} . }$$

Suppose we further want

$${ \text{Want:} \quad \pi _{\alpha(\Theta)} (\alpha(\theta)) \, \, \text{ doesn't change with } \alpha(\theta) . }$$

Hence setting

$${ \boxed{\pi _{\Theta} (\theta) \propto \sqrt{I _{\Theta} (\theta)}} }$$

and

$${ \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) \propto \sqrt{I _{\Theta} (\theta) } }$$

will do.

We call such a ${ \pi _{\Theta} (\theta) }$ as a Jeffrey's prior.

Hence intuitively Jeffrey's prior is the only distribution of ${ \Theta }$ such that there exists a reparameterisation ${ \alpha(\Theta) }$ which has uniform distribution and uniform Fisher information.

$\endgroup$
1
  • 1
    $\begingroup$ That doesn't mean that it is completely uninformative though. Jeffreys himself considered his prior as minimally informative. $\endgroup$ Commented Oct 31 at 14:01

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.