Why are Jeffreys priors considered noninformative?

Question

Consider a Jeffreys prior where $p(\theta) \propto \sqrt{|i(\theta)|}$, where $i$ is the Fisher information.

I keep seeing this prior being mentioned as a uninformative prior, but I never saw an argument why it is uninformative. After all, it is not a constant prior, so there has to be some other argument.

I understand that it does not depend on reparametrization, which brings me to the next question. Is it that the determinant of the Fisher information does not depend on reparametrization? Because Fisher information definitely depends on the parametrization of the problem.

Thanks.

Have you read the Wikipedia article? en.wikipedia.org/wiki/Jeffreys_prior — whuber
– whuber ♦, Commented Feb 22, 2011 at 23:12
Yes, I had looked there. perhaps I am missing something, but I do not feel that the Wikipedia article gives an adequate answer to my questions. — bayesian
– bayesian, Commented Feb 22, 2011 at 23:23
Note that the Jeffreys prior is not invariant with respect to equivalent models. For example Inference about a parameter $p$ is different when using binomial or negative binomial sampling distributions. This is despite the likelihood functions being proportional and the parameter having the same meaning in both models. — probabilityislogic
– probabilityislogic, Commented Oct 11, 2012 at 6:41
In an important sense, there's no such thing as a "constant" prior unless it's a discrete uniform distribution. There is such a thing as a density that is constant with respect to a measure, but the same density is not constant with respect to other measures. — Michael Hardy
– Michael Hardy, Commented Oct 31 at 13:57

Scortchi · Accepted Answer · 2013-10-23 15:43:12Z

18

It's considered noninformative because of the parameterization invariance. You seem to have the impression that a uniform (constant) prior is noninformative. Sometimes it is, sometimes it isn't.

What happens with Jeffreys' prior under a transformation is that the Jacobian from the transformation gets sucked into the original Fisher information, which ends up giving you the Fisher information under the new parameterization. No magic (in the mechanics at least), just a little calculus and linear algebra.

edited Oct 23, 2013 at 15:43

Scortchi♦

32.9k9 gold badges105 silver badges300 bronze badges

answered Feb 23, 2011 at 2:27

JMS

4,9701 gold badge27 silver badges33 bronze badges

10

$\begingroup$ I disagree with this answer. Using a subjective prior is also a parametrization invariant procedure ! $\endgroup$

Stéphane Laurent
– Stéphane Laurent

2012-10-10 11:26:27 +00:00
Commented Oct 10, 2012 at 11:26
$\begingroup$ Stephane, would you mind confirming my understanding based on your comment? I think what you mean is that, defining a prior subjectively, you can still transform back and forth between another parameterization using Jacobians. So the "invariant" thing about Jeffreys priors is the fact that starting from any parameterization, you maybe obtain a prior that is automatically related to the others by the "correct" rule (i.e. Jacobians), due to the way Fisher Information matrices transform. So, since no parameterization is "special" during prior construction, it is considered uninformative. Correct? $\endgroup$

SSD
– SSD

2024-02-05 21:29:17 +00:00
Commented Feb 5, 2024 at 21:29

Add a comment |

Stéphane Laurent · Accepted Answer · 2012-10-10 11:30:49Z

44

The Jeffreys prior coincides with the Bernardo reference prior for one-dimensional parameter space (and "regular" models). Roughly speaking, this is the prior for which the Kullback-Leibler divergence between the prior and the posterior is maximal. This quantity represents the amount of information brought by the data. This is why the prior is considered to be uninformative: this is the one for which the data brings the maximal amount of information.

By the way I don't know whether Jeffreys was aware of this characterization of his prior ?

answered Oct 10, 2012 at 11:30

Stéphane Laurent

20.7k5 gold badges79 silver badges113 bronze badges

5

$\begingroup$ " Roughly speaking, this is the prior for which the Kullback-Leibler divergence between the prior and the posterior is maximal." Interesting, I did not know that. $\endgroup$

Cam.Davidson.Pilon
– Cam.Davidson.Pilon

2012-10-10 14:57:04 +00:00
Commented Oct 10, 2012 at 14:57
3

$\begingroup$ (+1) Good answer. It would be nice to see some references of some of your points (e.g. 1, 2). $\endgroup$

user10525
– user10525

2012-10-10 18:10:44 +00:00
Commented Oct 10, 2012 at 18:10
2

$\begingroup$ @Procrastinator I am currently writing a new post about noninformative priors ;) Please wait, perhaps a few days. $\endgroup$

Stéphane Laurent
– Stéphane Laurent

2012-10-10 18:14:15 +00:00
Commented Oct 10, 2012 at 18:14
3

$\begingroup$ @StéphaneLaurent did you ever write that post? $\endgroup$

Neil G
– Neil G

2019-12-30 17:44:23 +00:00
Commented Dec 30, 2019 at 17:44
3

$\begingroup$ @StéphaneLaurent I know it's been a decade but that post would still be very helpful :-) $\endgroup$

user1717828
– user1717828

2022-09-05 15:48:32 +00:00
Commented Sep 5, 2022 at 15:48

| Show 1 more comment

Michael Hardy · Accepted Answer · 2025-10-31 13:55:35Z

This is an old but interesting topic. I recently thought about this and developed a take that I would like to share.

First off, the problem with flat priors as uninformative priors is that this idea is rooted in the way we would guess a number; not the way the data guess a number in likelihood-based inference.

We can understand this by comparing two binomial random variables: \begin{eqnarray} X &\sim& Bi(x\mid n=10,\theta=.5)\\ Y &\sim& Bi(y\mid n=10,\theta=.9) \end{eqnarray} Clearly, $E[X]=5$ and $E[Y]=9.$

The likelihood of finding $E[Y]=9$ under the distribution of $X$ is $\approx 0.01$, while the likelihood of finding $E[X]=5$ under the distribution of $Y$ is $\approx 0.0015$.

This fact is independent of parametrization (odds, log odds). For example, if $\phi$ are odds, sample odds of $.9/.1=9$ give not as much evidence against $H_0: \phi=1$ as sample odds of $1$ gives against $H_0: \phi=9$.

Hence, finding $x=5$ is much better at excluding $H_0:\theta=.9$ than finding $x=9$ is at excluding $H_0:\theta=.5$ (just considering one random variable $X \sim Bi(x\mid n,\theta)$ from now on). More generally, intermediate $x$ exclude extreme $\theta$ very well, but extreme $x$ do not exclude intermediate $\theta$ so well. This notion is formalized in Fisher's information, which is the expected curvature of the log likelihood given some $\theta$. The expected curvature of the log likelihood for binomial random variables equals \begin{equation} \frac{-n}{\theta(1-\theta)}. \end{equation} Referring back to the example above, it is readily verified that the curvature is equal to $-4n$ at $\theta=.5$, but $\approx-11n$ at $\theta=.9$. More curvature means that fewer values of $X$ are compatible with that value of $\theta$, so it is easier to find evidence against that value of $\theta$ (implying lower posterior density in the Bayesian setting).

Fisher's information is different for different parameterizations, but that's because it is on a different scale: the curvature may be different, but so is the distance between points. The net result is invariance under a transformation.

The key point of using Jeffreys' prior then seems to be that if we do not want to help the data making its decision, we should give less weight to points that are hard to find evidence against, and more weight to points that are easy to find evidence against (e.g., it would be unjust to give a lot of weight to $\theta=0.5$ because it is hard to exclude this point from the posterior anyway). We do so by taking the prior distribution proportional to the expected curvature of the likelihood, which is the square root of Fisher information if we parametrize the prior over $\theta$ (since Fisher information runs over $\theta^2$).

In the Binomial case, this gives a Beta distribution with parameters $.5$ and $.5.$ This distribution gives less weight to intermediate values of $\theta$ (values close to $0.5,$ which are hard to throw out of the posterior anyway) and more weight to extreme values of $\theta$ (values close to $0$ or $1,$ which are easy to throw out of the posterior).

From here, I see two ways forward. The first is to reject the notion of uninformative priors altogether because the Bayesian posterior is still different from the frequentist likelihood. The second is to say that by using the Jeffreys prior we finally have a method under which all values of $\theta$ are equally likely before we have seen the data (under frequentist likelihood-based inference, they are not). If I read Jeffreys' 1946 paper, it seems to be all about invariance under transformations. I can see how that is a necessary condition for a prior to be uninformative, but I'm not sure about its sufficiency. I'm not aware of Jeffreys wishing to correct a deficiency of likelihood-based frequentist inference (granted, I haven't looked very much), but that does seem to be the corollary. Take your pick.

Great answer! I was trying to go down that same line of thought using the Gaussian distribution prior (en.wikipedia.org/wiki/…) and your post helped me out immensely. — thc
– thc, Commented Oct 8, 2020 at 23:12
Great @thc! I corrected some minor mistakes and added a line about necessity versus sufficiency. — Salegg Apfelton
– Salegg Apfelton, Commented Oct 21, 2020 at 18:36
Very interesting post. But how do you make precise the claim " by using the Jeffreys prior we finally have a method under which all values of θ are equally likely before we have seen the data"? Thanks. — Tom Bennett
– Tom Bennett, Commented Oct 9, 2021 at 6:20

Dikran Marsupial · Accepted Answer · 2012-10-10 11:34:10Z

7

I'd say it isn't absolutely non-informative, but minimally informative. It encodes the (rather weak) prior knowledge that you know your prior state of knowledge doesn't depend on its parameterisation (e.g. the units of measurement). If your prior state of knowledge was precisely zero, you wouldn't know that your prior was invariant to such transformations.

answered Oct 10, 2012 at 11:34

Dikran Marsupial

58.4k9 gold badges154 silver badges236 bronze badges

1

$\begingroup$ I am confused. In what sort of case would you know that you prior should depend on the model parameterization? $\endgroup$

John Lawrence Aspden
– John Lawrence Aspden

2013-04-30 19:42:58 +00:00
Commented Apr 30, 2013 at 19:42
2

$\begingroup$ If we want to predict longevity as a function of body weight, using a GLM, we know that the conclusion should not be affected whether we weigh the subject in kg or lb; if you use a simple uniform prior over the weights you might get different outcome depending on the units of measurement. $\endgroup$

Dikran Marsupial
– Dikran Marsupial

2013-05-01 10:08:35 +00:00
Commented May 1, 2013 at 10:08
2

$\begingroup$ That's a case when you know that it shouldn't be affected. What is a case where it should? $\endgroup$

John Lawrence Aspden
– John Lawrence Aspden

2013-05-03 17:21:45 +00:00
Commented May 3, 2013 at 17:21
1

$\begingroup$ I think you are missing my point. Say we don't know anything about the attributes, not even that they have units of measurement to which the analysis should be invariant. In that case your prior would encode less information about the problem than the Jeffrey's prior, hence the Jeffrey's prior is not completely uninformative. The may or may not be situations where the analysis should not be invariant to some transformation, but that is beside the point. $\endgroup$

Dikran Marsupial
– Dikran Marsupial

2013-05-03 17:51:03 +00:00
Commented May 3, 2013 at 17:51
2

$\begingroup$ N.B according to the BUGS book (p83), Jeffrey's himself referred to such transformation invariant priors as being "minimally informative", which implies that he saw them as encoding some information about the problem. $\endgroup$

Dikran Marsupial
– Dikran Marsupial

2013-05-03 18:06:50 +00:00
Commented May 3, 2013 at 18:06

Add a comment |

Frank Harrell · Accepted Answer · 2025-10-31 13:22:24Z

Instead of spending time thinking about transformations and non-informative priors I prefer to think of having “no prior”. This stems from the fact that in Bayesian posterior inference the log posterior sums the log likelihood and the log prior. If there is no prior (e.g., you are using Stan software and just omit any mention of a prior) the log posterior comes solely from the log-likelihood. With Bayes, information from the data and information from the prior are completely interchangeable, and the posterior distribution can be derived from a data augmentation argument. Not having prior data is equivalent to not having a prior.

Venkata Karthik Bandaru · Accepted Answer · 2025-10-31 10:39:42Z

Ref: "Weighing the odds" by Williams.

Consider random variable ${ [X \vert \Theta = \theta] }$ ${ \sim f(x \vert \Theta = \theta) }$ and parameter ${ \Theta \sim \pi _{\Theta} (\theta) . }$

Suppose the densities ${ f(x \vert \Theta = \theta) }$ are fixed.

Suppose we are free to choose ${ \pi _{\Theta} (\theta) . }$

Note that the Fisher information of ${ \Theta }$ is

$${ I _{\Theta} (\theta) = \int _{- \infty} ^{+ \infty} \left( \frac{\partial}{\partial \theta} \ln f(x \vert \Theta = \theta) \right) ^2 f ( x \vert \Theta = \theta) \, dx . }$$

Note that ${ I _{\Theta} (\theta) }$ might vary with ${ \theta . }$

The goal is to find a reparameterisation ${ \alpha }$ such that

$${ \text{Want:} \quad I _{\alpha(\Theta)} (\alpha(\theta)) \, \, \text{ doesn't vary with } \alpha(\theta) . }$$

Let ${ \alpha }$ be such a reparameterisation.

Note that in general

$${ {\begin{aligned} &\, I _{\alpha(\Theta)}(\alpha(\theta)) \\ = &\, \int _{- \infty} ^{+ \infty} \left( \frac{\partial}{\partial \alpha(\theta)} \ln f(x \vert \alpha(\Theta) = \alpha(\theta)) \right) ^2 f ( x \vert \alpha(\Theta) = \alpha(\theta)) \, dx \\ = &\, \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) ^{-2} \int _{- \infty} ^{+ \infty} \left( \frac{\partial}{\partial \theta} \ln f(x \vert \Theta = \theta) \right) ^2 f ( x \vert \Theta = \theta) \, dx \\ = &\, \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) ^{-2} I _{\Theta} (\theta) . \end{aligned}} }$$

Hence in general

$${ \boxed{ I _{\alpha(\Theta)}(\alpha(\theta)) \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) ^2 = I _{\Theta} (\theta) } . }$$

Hence

$${ \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) \propto \sqrt{I _{\Theta} (\theta) } . }$$

How is the new parameter ${ \alpha (\Theta) }$ distributed?

Note that in general

$${ \boxed{{\begin{aligned} &\, \pi _{\alpha(\Theta)} (\alpha(\theta)) \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) = \pi _{\Theta} (\theta) \end{aligned}}} . }$$

Suppose we further want

$${ \text{Want:} \quad \pi _{\alpha(\Theta)} (\alpha(\theta)) \, \, \text{ doesn't change with } \alpha(\theta) . }$$

Hence setting

$${ \boxed{\pi _{\Theta} (\theta) \propto \sqrt{I _{\Theta} (\theta)}} }$$

and

$${ \left( \frac{\partial \alpha(\theta)}{\partial \theta} \right) \propto \sqrt{I _{\Theta} (\theta) } }$$

will do.

We call such a ${ \pi _{\Theta} (\theta) }$ as a Jeffrey's prior.

Hence intuitively Jeffrey's prior is the only distribution of ${ \Theta }$ such that there exists a reparameterisation ${ \alpha(\Theta) }$ which has uniform distribution and uniform Fisher information.

That doesn't mean that it is completely uninformative though. Jeffreys himself considered his prior as minimally informative. — Dikran Marsupial
– Dikran Marsupial, Commented Oct 31 at 14:01

Stack Exchange Network

Why are Jeffreys priors considered noninformative?

6 Answers 6

Linked

Hot Network Questions

Why are Jeffreys priors considered noninformative?

6 Answers 6

Linked

Related

Hot Network Questions