How to re-mean a vector of probabilities, without having values beyond the [0, 1] bounds?

Question

Imagine we have a vector of probabilities, $x$. Its mean is $\bar{x}$. Now, let's say there is an arbitrary value between 0 and 1 that we would like to re-mean the vector to. That's some given proportion, $p$.

What I would like to do is go from $x$ to $x'$ where $\bar{x'} = p$. However, these represent probabilities, so $x' \in [0, 1]$ also needs to be satisfied. Additionally, $x'$ needs to retain the relative rank ordering of $x$.

One way of doing this would be to simply de-mean the variable and add back $p$:

$x' = x - \bar{x} + p$

However, this will force values to be below 0 or above 1, so not representing probabilities anymore.

Another way of doing this is taking both $x$ and $p$ into logit space, doing the re-meaning there, and then bringing it back out. For shorthand, I'm using $logit$ as the logistic function and $invlogit$ as its inverse:

$x' = invlogit(logit(x) - \overline{logit(x)} + logit(p))$

However, the mean in probability space will no longer be $p$.

How can I re-mean a vector of probabilities, without having values beyond the [0, 1] bounds?

Below is a demonstration in R.

remean <- function(x, p) x - mean(x) + p remean_logit <- function(x, p) { x <- qlogis(x) x <- x - mean(x) + qlogis(p) return(plogis(x)) } set.seed(1839) p <- .56 x <- runif(1000) summary(x) # mean is .497 summary(remean(x, p)) # mean is correct, but max > 1 summary(remean_logit(x, p)) # values within 0 and 1, but mean is incorrect

Setting all $x=p$ has the mean $p$, but presumably you also want to preserve some other qualities of the data, such as their relative ordering. So a completely general solution is $p = \sigma(ax+b)$ where $a >0$ and $b$ is any real number and $\sigma(y) = \frac{1}{1 + \exp(-y)}$. You can adjust $a,b$ to get any mean $ 0 < p <1$ you desire. — Sycorax
– Sycorax ♦, Commented Nov 19, 2024 at 20:16
"However, this will force values to be below 0 or above 1, so not representing probabilities anymore" - I don't see any particular reason why the end result should represent probabilities. The simple fact that the transformation puts everything in the range [0, 1] says nothing about whether those values have any probabilistic interpretation whatsoever. Say you have a weather forecast with the probability of rain for the next week, and you want to adjust it by some value p. The "re-meaned" values aren't calibrated probabilities and say little about the actual chance of rain. — Nuclear Hoagie
– Nuclear Hoagie, Commented Nov 19, 2024 at 21:10
The constraint on the mean is a 1D constraint on a countably infinite space of functions, so those requesting additional desiderata are putting it mildly! There's a huge array of solutions. — whuber
– whuber ♦, Commented Nov 19, 2024 at 21:21
I get the feeling this is an XY problem. It might help to explain the underlying problem this "remeaning" is attempting to solve, so that we can help identify good ways to solve the actual problem rather than solving the "problem" with your attempted solution to your original problem. — Glen_b
– Glen_b, Commented Nov 19, 2024 at 22:11

Ben · Accepted Answer · 2024-11-19 21:02:41Z

As Sycorax mentioned, there are probably other desiderata you have in mind that you haven't made explicit, but here's something to get us started: find the offset in logit space, but minimize the mean in proportion space.

Here's what I mean. Let $x' = \sigma(\sigma^{-1}(x)+\eta)$, and solve numerically for $\eta$ such that $\bar{x}' = p$.

Quick example in R:

x <- rbeta(10,1,1) # Input data print(mean(x)) # Existing mean p <- 0.69 # Desired mean cost <- function(eta) (mean(plogis(qlogis(x) + eta))-p)^2 eta_opt <- optimize(cost, interval = c(-10,10))$minimum xprime <- plogis(qlogis(x) + eta_opt) print(mean(xprime)) # Mean as desired.

Makes sense with what I'm looking for, thank you. For future reference, I edited the original post to also include that I'm wanting to retain the relative rank ordering of values. — Mark White
– Mark White, Commented Nov 20, 2024 at 2:30

jblood94 · Accepted Answer · 2024-11-29 21:50:12Z

To summarize the goal as presented in the question, starting with a vector $x\in[0,1]$, perform a transformation $f(x)=x'$ such that:

$x'\in[0,1]$
$\bar{x'}=p$, where $p\in[0,1]$
If $x_i>x_j$, then $x'_i>x'_j$ ($f$ is an increasing function)

There are myriad transformations that can accomplish this. One general strategy is to transform $x$ to the positive reals via some function $g$, multiply the transformed values by a positive constant, $a$, and then invert the transformation:

$$x'=g^{-1}(a\cdot g(x))$$

Alternatively, transform $x$ to $(-\infty,\infty)$, add or subtract $a$, and then invert the transformation.

Notice that $g$ can be the quantile function of any distribution with support on $[0,\infty)$ (or $(-\infty,\infty)$). A few examples:

Exponential distribution: $g(x)=-\ln(x)$ (which results in $x'=x^a$)
Log-logistic distribution: $g(x)=x/(1-x)$
Half-Cauchy distribution: $g(x)=\text{tan}(\pi x/2)$

Each will behave a little differently, so choose a transformation that has desirable properties for what you are doing.

Demonstrating in R:

set.seed(1839) x <- runif(1000) p <- 0.56 np <- p*length(x) f <- vector("list", 6) # log-logistic f[[1]] <- function(a) (y <- a^2/(1/x - 1))/(y + 1) # exponential f[[2]] <- function(a) pexp(qexp(x, exp(a))) # gamma f[[3]] <- function(a) pgamma(qgamma(x, 1, 1), 1, exp(a)) # chi-squared f[[4]] <- function(a) pchisq(exp(a)*qchisq(x, 1), 1) # log-normal f[[5]] <- function(a) plnorm(exp(a)*qlnorm(x)) # normal f[[6]] <- function(a) pnorm(qnorm(x), a) a <- sapply(f, \(f) uniroot(\(a) sum(f(a)) - np, 0:1, extendInt = "yes", tol = 1e-9)$root) y <- lapply(1:length(f), \(i) f[[i]](a[i])) max(abs(sapply(y, mean) - p)) #> [1] 4.24305e-11 r <- rank(x) all(sapply(y, \(y) all.equal(rank(y), r))) #> [1] TRUE

Another consideration is efficiency when implemented in code. Here are a few candidate functions in R to benchmark.

remean1 <- function(x, p, np = p*length(x)) { x^uniroot(\(r) sum(x^r) - np, log(p)/log(range(x)), tol = 1e-9)$root } remean2 <- function(x, p, np = p*length(x)) { a <- uniroot(\(a) sum((y <- a/(1/x - 1))/(y + 1)) - np, (1/p - 1)/(1/range(x) - 1), tol = 1e-9)$root (y <- a/(1/x - 1))/(y + 1) } remean3 <- function(x, p, np = p*length(x)) { a <- uniroot(\(a) 2*sum(atan(tan(x*pi/2)/a))/pi - np, atanh(range(x))/atanh(p), tol = 1e-9)$root 2*atan(tan(x*pi/2)/a)/pi } remean4 <- function(x, p, np = p*length(x)) { a <- uniroot(\(a) sum(tanh(atanh(x)/a)) - np, atanh(range(x))/atanh(p), tol = 1e-9)$root tanh(atanh(x)/a) } mean(remean1(x, p)) - p #> [1] -2.251033e-11 mean(remean2(x, p)) - p #> [1] 0 mean(remean3(x, p)) - p #> [1] -9.473866e-12 mean(remean4(x, p)) - p #> [1] -4.674039e-14

Benchmark

microbenchmark::microbenchmark( remean1(x, p), remean2(x, p), remean3(x, p), remean4(x, p) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> remean1(x, p) 2117.7 2176.00 2504.393 2236.10 2475.75 15133.2 100 #> remean2(x, p) 237.0 336.30 391.087 351.55 400.60 1573.0 100 #> remean3(x, p) 513.0 569.80 816.528 593.15 660.60 13371.0 100 #> remean4(x, p) 702.0 759.95 988.719 781.40 950.65 10681.8 100

Why would one prefer this method? What problem does it solve that isn’t solved by the other answer? Are there any disadvantages to using it? OP asks for transformations that look like adding a constant, which is not what this answer proposes, so the (subjective) claim that it’s “simpler” doesn’t seem like a relevant differentiation. — Sycorax
– Sycorax ♦, Commented Nov 27, 2024 at 20:30
I get you. See the updated answer. The motivation behind the original answer was to point out that there is more than one way to accomplish the task. Also, I don't read the question as the OP asking for additive transformations--that was just one approach that the OP had (unsuccessfully) explored. — jblood94
– jblood94, Commented Nov 28, 2024 at 4:32

Stack Exchange Network

How to re-mean a vector of probabilities, without having values beyond the [0, 1] bounds?

2 Answers 2

Hot Network Questions

How to re-mean a vector of probabilities, without having values beyond the [0, 1] bounds?

2 Answers 2

Related

Hot Network Questions