I’ve recently encountered specific problem and I am ashamed to admit that I am quite stuck. Suppose that I have random sample of data for which I want to calculate e.g. 95th quantile. Suppose that it is reasonable to fit normal distribution $N(\mu, \sigma^2)$ for this sample. I need to estimate parameters $\mu$ and $\sigma^2$ first in order to calculate the quantile (suppose that they are uknown in practice). The straightforward way is then to obtain quantile of normal distribution with estimated parameters. But I’ve encountered question whether the quantile cannot be calculated by using the Student’s distribution quantile such as $$ \overline{x} + t_{0.95}(n-1)\cdot s, $$ where $\overline{x}$ is sample mean, $s$ is sample standard deviation and $n$ is the size of the sample. I am failing to justify why I can’t use the Student’s quantile (apart from argument that it is different distribution) since it resembles me the idea of confidence interval construction (where the upper bound can be perceived as quantile) for the expected value parameter $\mu$ (even though I know this is not the goal now). In this situation I also use the estimate for unknown variance parameter and, therefore, it intuitively leads me towards the Student’s distribution. Could somebody provide me with different insight? Thank you!
1 Answer
You can do this, and it should give you a better estimator (lower expected mean square error) than just taking the $95\%$ quantile of your data if your sample size were large enough to allow this. Simulate it and see, and in particular note how much uncertainty there is with small to medium sized samples. Whether it is the best estimator is a different question: for very small $n$, I suspect $\overline{x} +\sqrt{\frac{n^2-1}{n^2}} t_{0.95}(n)\cdot s$ may be better and there may be others which are better still.
As an illustration, here is a simulation with $\mu=77$ and $\sigma=14$ so the $95$th percentile of a normal distribution should be close to $100$. With a sample size of $n=100$, your estimator is distributed as the black empirical density below and mine in red (they almost exactly overlap), while the blue line is an estimator directly taking the quantile from the data while ignoring the fact we know it was sampled from a normal distribution.
That used the following R code:
est <- function(sampsize, p, simmean=0, simsd=1){ x <- rnorm(sampsize, simmean, simsd) mx <- mean(x) sx <- sd(x) return(c(mx, sx, quantile(x, p), mx + qt(p,sampsize-1)*sx, mx + qt(p,sampsize)*sx*sqrt((sampsize^2-1)/sampsize^2) )) } set.seed(2025) p <- 0.95 sampsize <- 100 simmean <- 77 simsd <- 14 cases <- 10^5 sims <- replicate(cases, est(sampsize, p, simmean, simsd)) plot(density(sims[5,]), col="red" ) lines(density(sims[4,]), col="black") lines(density(sims[3,]), col="blue") abline(v=qnorm(p, simmean, simsd)) If instead this had used a much smaller sample size, say $n=4$, the density of the estimators would be much more widely spread, and mine in red would appear to tend to perform better than yours in black.
- $\begingroup$ Thank you for detailed answer, this really helps! Maybe out of curiousity, where does your estimator (the one that corresponds to the red line) come from? I’ve never seen the formula like this. It quite resembles me formula of prediction interval for the next observation in case of normal distribution when parameters are unknown (en.m.wikipedia.org/wiki/Prediction_interval). Does it have some connection to it? Previously, it led me to the prediction interval formula, but I am not sure if it can be used (since it is for the next new observation). $\endgroup$thepotato– thepotato2025-02-18 07:33:24 +00:00Commented Feb 18 at 7:33
- $\begingroup$ @thepotato I actually took it from playing with the predictive distribution based on a conjugate prior for a normal distribution with unknown mean and variance/precision with all the initial prior parameters set to $0$. Note that while a prediction interval apparently uses $\sqrt{\frac{n+1}{n}}$, my suggestion uses $\sqrt{\frac{n+1}{n}\frac{n-1}{n}}$. I do not know which if either is best. $\endgroup$Henry– Henry2025-02-18 11:44:25 +00:00Commented Feb 18 at 11:44
- $\begingroup$ Thinking further and aiming for a less biased estimator, I came up with yet another alternative using the quantile function of the normal distribution rather than of the $t$-distribution: $\overline{x} +\sqrt{\dfrac{n-1}{2}}\dfrac{\Gamma\left(\frac{n-1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)} \Phi^{-1}(0.95)\cdot s$. $\endgroup$Henry– Henry2025-02-19 00:37:36 +00:00Commented Feb 19 at 0:37
- $\begingroup$ ... but even that may not minimise the mean square error. $\endgroup$Henry– Henry2025-02-19 00:54:27 +00:00Commented Feb 19 at 0:54
- $\begingroup$ Thank you for the effort @Henry, I think the previous estimators are sufficient for my purposes :) but your ideas are sure interesting! $\endgroup$thepotato– thepotato2025-02-19 11:15:24 +00:00Commented Feb 19 at 11:15

