4
$\begingroup$

I would like to compute ratio of proportions coming from a Dirichlet distribution. My understanding is that each proportion should be treated as a random variable and therefore I should use Taylor expansion to compute mean and variance of ratio. However when I follow this procedure I seem to obtain very different results between the first and second order Taylor expansion, and more importantly the second order Taylor results are difficult to interpret (ie they dont seem to make sense). As I am not very familiar with the Dirichlet distribution, I was wondering whether I missed something.

Here is an illustration of the problem on simulated data (using R and the DirichletReg package for the modelling) - In this example, we simulate 6 proportions for 10,000 individuals using a Dirichlet distribution.

# load R package library(DirichletReg) # simulate data (6 proportions, 10000 individuals) pct0 = (1:6)/sum(1:6) set.seed(12345) data = rdirichlet(10000,pct0) data = as.data.frame(data) colnames(data) = paste0('pct',1:6) # model estimation data$Y = DR_data(data[,paste0('pct',1:6)]) fit = DirichReg(Y ~ 1, data=data, model='common') # retrieve model estimates mle = as.numeric(fit$coefficients) names(mle) = paste0('pct',1:6) # compute average a0 = sum(exp(mle)) pct1 = exp(mle)/a0 # compare estimated vs true proportions cbind(pct0,pct1) # compute mean (mu), variance (va) and covariance (cov) of Dirichlet distribution for proportion_1 and proportion_2 # source: https://en.wikipedia.org/wiki/Dirichlet_distribution mu1 = pct1[1] mu2 = pct1[2] va1 = (mu1*(1-mu1))/(1+a0) va2 = (mu2*(1-mu2))/(1+a0) cov12 = -(mu1*mu2)/(1+a0) # compute average of ratio of proportion_1/proportion_2 # source: https://www.stat.cmu.edu/~hseltman/files/ratio.pdf # with first order taylor expansion muratio1 = mu1/mu2 # with second order taylor expansion muratio2 = (mu1/mu2) - (cov12/(mu2**2)) + (mu1*va1/(mu2**3)) # compare results c(muratio1,muratio2) 

I would have 2 questions for you:

(1) Is it ok to obtain average proportions from the Dirichlet regression (DR) that are quite different from the true values? For example, true proportion for proportion_1 was 0.04761905, and when we compute the mean over 10000 individuals we obtain 0.04866258, but with the DR model the corresponding average would be 0.09224526 (so almost twice the size of the true value!).

(2) The average ratio obtained with 1st Taylor expansion is 0.7919314 and I would say is in line with the true values (The ratio of true values is 0.5). However the 2nd Taylor expansion gives us a very different result (3.239524). If I understand this value correctly, this is telling us that average of proportion_1/proportion_2 ratio is 3.24, which would suggest that on average the proportion_1 is larger than proportion_2, but we know that this is not true.

$\endgroup$
3
  • 5
    $\begingroup$ Close-voters: I believe this question is primarily about statistics, not programming, and that it should therefore stay open. See Closing "software questions". $\endgroup$ Commented Mar 1, 2023 at 16:17
  • 1
    $\begingroup$ I agree that this is primarily about statistics but there is a programming issue. Given the particular Dirichlet parameters used, many of the observed values are very near 0 or 1. DR_data modifies those values to be moved a bit towards 0.5 based on a threshold. That "messes up" the smaller and larger proportions. If you use data$Y = DR_data(data[,paste0('pct',1:6)], trafo = 10^(-100)), you'll get more reasonable results for this set of Dirichlet parameters. $\endgroup$ Commented Mar 6, 2023 at 4:53
  • $\begingroup$ Thanks Jimb! Indeed this seems to make a difference. Based on the example in my post, I've compared the results of the DR_data transformation using 3 different values: (1) default (trafo=sqrt(.Machine$double.eps)), (2) a bit smaller (trafo=10^(-10)) and (3) very small as suggested in your answer (trafo=10^(-100)). The results dont differ between (1) and (2) but they do with (3) and they are more accurate. $\endgroup$ Commented Mar 7, 2023 at 7:11

2 Answers 2

5
$\begingroup$

For Dirichlet random variates $x_1,...x_K$ with concentration parameters $\alpha_1...\alpha_K$,

$$y_{i,j}=\frac{x_i}{x_i+x_j}\sim\text{Beta}(\alpha_i,\alpha_j)$$ and $$\frac{x_i}{x_j}=\frac{y_{i,j}}{1-y_{i,j}}\sim\beta'(\alpha_i,\alpha_j)$$ which is the beta prime distibution. The mean exists for $\alpha_j>1$, and the variance exists for $\alpha_j>2$.

$$P(x_i<x_j)=P\bigg(\frac{x_i}{x_j}<1\bigg)=I_{1/2}(\alpha_i,\alpha_j)$$

where $I$ is the regularized incomplete beta function.

pbeta(0.5, 1/sum(1:6), 2/sum(1:6)) #> [1] 0.6677137 

Estimates of $\mathbf{\alpha}$

I didn't look into the specifics of DirichReg, but for estimates of $\mathbf{\alpha}$, the moments estimates do well here, from which we can use optim for the MLE:

# load R package library(DirichletReg) # simulate data (6 proportions, 10000 individuals) pct0 = (1:6)/sum(1:6) set.seed(12345) n <- 1e4L data = rdirichlet(n, pct0) data = as.data.frame(data) colnames(data) = paste0('pct',1:6) alpha0 <- colMeans(data) alpha0 <- alpha0*(alpha0[3]*(1-alpha0[3])/var(data[,3]) - 1) tLogData <- t(log(data)) alpha <- exp( optim( log(alpha0), function(alpha) { alpha <- exp(alpha) sum((1 - alpha)*tLogData) + n*(sum(lgamma(alpha)) - lgamma(sum(alpha))) }, method = "L-BFGS-B" )$par ) matrix(c(pct0, alpha0, alpha), 3, 6, 1, list(c("pct0", "MoM", "MLE"), colnames(data))) #> pct1 pct2 pct3 pct4 pct5 pct6 #> pct0 0.04761905 0.09523810 0.1428571 0.1904762 0.2380952 0.2857143 #> MoM 0.04847265 0.09570196 0.1431152 0.1910795 0.2377477 0.2799800 #> MLE 0.04797821 0.09448738 0.1427506 0.1881108 0.2355976 0.2822832 

Computing $P(\alpha_i>\alpha_j)$

From the comments:

...ultimately I would like to be able to test for differences between ratios. Say the % correspond to budget shares spent on different types of products (20% on A; 30% on B; 10% on C and 40% on D), we could say that people spent 1.5 more on B than A and 2 times more on D than A - But is 1.5 significantly different from 2?

Comparing the ratio of the proportion spent on B vs A to the ratio of the proportion spent on D vs A is equivalent to comparing the proportion spent on B to the proportion spent on D:

$$P\bigg(\frac{\alpha_D}{\alpha_A}>\frac{\alpha_B}{\alpha_A}\bigg)=P(\alpha_D>\alpha_B)=P(E[x_D]>E[x_B])$$

In what follows, I will use the notation $x_{(i)k}$ to refer to the proportion of class $i$ in the $k^{\text{th}}$ observation out of $n$ total observations.

Recall:

$$y_{(i,j)k}=\frac{x_{(i)k}}{x_{(i)k}+x_{(j)k}}\sim\text{Beta}(\alpha_i,\alpha_j)$$

We can obtain $P(\alpha_i>\alpha_j)$ by numerically integrating the joint posterior:

$$P(\alpha_i>\alpha_j)=\frac{\int_0^{\infty}\int_0^{\alpha_i}\pi(\alpha_i,\alpha_j)\prod_{k=1}^n\frac{y_{(i,j)k}^{\alpha_i-1}(1-y_{(i,j)k})^{\alpha_j-1}}{B(\alpha_i,\alpha_j)}d\alpha_jd\alpha_i}{\int_0^{\infty}\int_0^{\infty}\pi(\alpha_i,\alpha_j)\prod_{k=1}^n\frac{y_{(i,j)k}^{\alpha_i-1}(1-y_{(i,j)k})^{\alpha_j-1}}{B(\alpha_i,\alpha_j)}d\alpha_jd\alpha_i}$$

where $\pi(\alpha_i,\alpha_j)$ is the prior distribution of $\alpha_i$ and $\alpha_j$.

The following R function performs this calculation:

library(cubature) alpha_compare <- function(x, plogab) { # Performs a pairwise comparison of Dirichlet parameters # INPUT: # x: Dirichlet-distributed data, with observations along the rows and # categories along the columns. # plogab: function returning the natural log of the prior distribution # OUTPUT: a ncol(x)-by-ncol(x) matrix, p, where p[i, j] = P(alpha_i > alpha_j) n <- nrow(x) m <- ncol(x) alpha0 <- colMeans(x) # MoM parameter estimation alpha0 <- alpha0*(alpha0[m%/%2]*(1-alpha0[m%/%2])/var(x[,m%/%2]) - 1) # precalculate sum(log(y)) and sum(log(1 - y)) sum_ln_y <- outer(1:m, 1:m, Vectorize(function(i, j) sum(log(x[,i]/(x[,i] + x[,j]))))) p <- diag(NA, m) # posterior likelihood f <- function(a) function(b) exp(n*(lgamma(a + b) - lgamma(a) - lgamma(b)) + (a - 1)*sum_ln_y[i, j] + (b - 1)*sum_ln_y[j, i] + plogab(a, b) - C) g <- function(u = NA) Vectorize(function(a) cubintegrate(f(a), 0, if (is.na(u)) a else u)$integral) for (i in 1:(m - 1)){ for (j in (i + 1):m) { # use a scaling constant for numeric stability C <- n*(lgamma(alpha0[i] + alpha0[j]) - lgamma(alpha0[i]) - lgamma(alpha0[j])) + (alpha0[i] - 1)*sum_ln_y[i, j] + (alpha0[j] - 1)*sum_ln_y[j, i] + plogab(alpha0[i], alpha0[j]) p[i, j] <- cubintegrate(g(), 0, Inf)$integral if (p[i, j] != 0) p[i, j] <- p[i, j]/cubintegrate(g(Inf), 0, Inf)$integral p[j, i] <- 1 - p[i, j] } } p } 

Demonstrated using $\pi(\alpha_i,\alpha_j)=e^{-\alpha_i-\alpha_j}$, first with the values of $\mathbf{\alpha}$ from the OP's simulated data:

set.seed(69916044) n <- 1e4L m <- 6L x <- matrix(rgamma(m*n, rep((1:m)/sum(1:m), each = n)), n, m) x <- x/rowSums(x) colMeans(x) #> [1] 0.05181357 0.09296486 0.14159702 0.18714575 0.23150131 0.29497748 logprior <- function(a, b) -a - b alpha_compare(x, logprior) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] NA 0.0000000 0.000000e+00 0.000000e+00 0.000000000 0.000000e+00 #> [2,] 1 NA 3.088801e-06 0.000000e+00 0.000000000 0.000000e+00 #> [3,] 1 0.9999969 NA 9.619879e-104 0.000000000 0.000000e+00 #> [4,] 1 1.0000000 1.000000e+00 NA 0.000132252 6.671738e-189 #> [5,] 1 1.0000000 1.000000e+00 9.998677e-01 NA 1.758110e-58 #> [6,] 1 1.0000000 1.000000e+00 1.000000e+00 1.000000000 NA 

With a large number of samples and a large separation between the values of $\mathbf{\alpha}$, the likelihoods are unsurprisingly close to 0 and 1.

A second example with a smaller number of samples and less separation between the values of $\mathbf{\alpha}$:

n <- 100L x <- matrix(rgamma(m*n, rep(m + (1:m)/sum(1:m), each = n)), n, m) x <- x/rowSums(x) colMeans(x) #> [1] 0.1557781 0.1542671 0.1676780 0.1816739 0.1702702 0.1703327 alpha_compare(x, logprior) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] NA 0.6131259 0.1542035 0.010199254 0.03471897 0.06467440 #> [2,] 0.3868741 NA 0.1090327 0.005366352 0.01768526 0.04666039 #> [3,] 0.8457965 0.8909673 NA 0.105368051 0.22317977 0.32570371 #> [4,] 0.9898007 0.9946336 0.8946319 NA 0.77107643 0.82310036 #> [5,] 0.9652810 0.9823147 0.7768202 0.228923568 NA 0.60937322 #> [6,] 0.9353256 0.9533396 0.6742963 0.176899641 0.39062678 NA 
$\endgroup$
5
  • $\begingroup$ Thanks jblood94! I did not know about this Beta prime distribution :) Just to make sure I got it right - If we run a simulation study with pct0 = 100*(1:6)/sum(1:6) the alpha estimates will be {a1=4.777; a2=9.570}. Therefore both mean and variance would exist, and mean would be 4.777/(9.570-1)=0.557, what seems to be in line with pct0[1]/pct0[2] = 0.5. In my example I am interested in computing several ratios (x2/x1; x3/x1; etc), does that mean that i need to apply this procedure for each pairwise comparison? Then we would completely ignore the covariances with left-out X - is that ok? $\endgroup$ Commented Mar 7, 2023 at 8:00
  • $\begingroup$ What are you interested in for the ratios? Are MLEs good enough, or do you want to also measure uncertainty? $\endgroup$ Commented Mar 7, 2023 at 11:42
  • $\begingroup$ That's because ultimately I would like to be able to test for differences between ratios. Say the % correspond to budget shares spent on different types of products (20% on A; 30% on B; 10% on C and 40% on D), we could say that people spent 1.5 more on B than A and 2 times more on D than A - But is 1.5 significantly different from 2? I was thinking to derive the ratios (with the formulae you provided) and then apply the Delta method to get the standard errors around these new quantities. $\endgroup$ Commented Mar 7, 2023 at 16:19
  • $\begingroup$ I updated the answer with a Bayesian method for getting pairwise probability comparisons between the category means. $\endgroup$ Commented Mar 8, 2023 at 15:24
  • $\begingroup$ Very good! Sorry that I can only give 1 upvote per answer. $\endgroup$ Commented Mar 8, 2023 at 17:20
4
$\begingroup$

You can obtain the exact mean and variance of the ratio of any two elements of a Dirichlet distributions (assuming that the mean and variance exists which depends on the values of the parameters).

I'm going to take the lazy way out and use Mathematica.

First define the distribution of the ratio of the first two elements:

dist = DirichletDistribution[{a1, a2, a3, a4, a5, a6}]; dist12 = TransformedDistribution[x1/x2, {x1, x2, x3, x4, x5} \[Distributed] dist]; 

Now determine the pdf of the ratio:

pdf = PDF[dist12, r] 

$$\frac{r^{a_1-1} (r+1)^{-a_1-a_2} \Gamma (a_1+a_2)}{\Gamma (a_1) \Gamma (a_2)}$$

for $r>0$ and 0 elsewhere.

The mean and variance when those exist are

Mean[dist12] 

$$a_1/(a_2-1)$$

Variance[dist12] 

$$\frac{a_1 (a_1+a_2-1)}{(a_2-2) (a_2-1)^2}$$

The mean will only exist if $a_2>1$ and the variance will only exist if $a_2>2$. Your example has both $a_1$ and $a_2$ less than 1 so neither the mean nor variance exist in that case.

$\endgroup$
2
  • $\begingroup$ Thanks (again) JimB. Very instructive! However I struggle to understand the issue with the scaling of the data. In my initial example, the true proportions were defined in the [0;1] interval, causing thus the alpha/concetration parameters (a1; a2) to be < 1. But if i change the scale of the proportions by defining them in the [0;100] interval (such transformation would still make sense) pct0 = 100*(1:6)/sum(1:6) then the resulting alpha parameters would this time all be > 1 (a1=4.777294; a2=9.569859). $\endgroup$ Commented Mar 7, 2023 at 7:43
  • $\begingroup$ You can have the parameters way over 1 and you'll always generate random values between 0 and 1 which sum to 1 with rdirichlet. If you need the resulting sample values to be percentages, then that is a Dirichlet random variable multiplied by 100 which by definition is not a Dirichlet random variable. But maybe I'm not understanding what you need. $\endgroup$ Commented Mar 8, 2023 at 17:04

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.