Bayesian Priors in the Real World : Regression Models vs Probability Distributions

Question

Recently, I posted the a question on "How Bayesian Priors are Decided in Real Life" (How are Bayesian Priors Decided in Real Life?). In this question, I created an example where researchers are collecting "height, weight and age" measurements on giraffes for the purpose of predicting the age of a giraffe using the height and weight (e.g. linear regression) . Furthermore, the researchers are interested in using well-established biological information about giraffes as "Bayesian Priors".

When posting this question, I realized a potential problem that might exist when incorporating previous known information into the Bayesian Priors: How can the "prior" distributions of the variables we are using (e.g. the distribution of the "height" and "weight" variables in the natural world) be incorporated into their corresponding model parameter distributions (e.g. in a regression model : beta_height and beta_weight)?

To illustrate my problem (and continue the giraffe example), suppose we collected (directly observed) the following data collected on giraffes (20 Measurements - Realistic Example with Possible Measurement Errors, e.g. a giraffe with height = 17.55129 feet, weight = 630.8886 lbs and age = 20):

 height weight age 1 13.14600 2882.7709 49 2 12.65080 3183.7991 48 3 13.84154 3138.2280 48 4 15.25780 2786.5297 49 5 15.01213 3006.9687 50 6 14.37567 3286.9644 50 7 12.99385 2881.7667 51 8 15.38893 2916.1883 50 9 14.80093 2791.7292 49 10 15.40423 2427.7706 50 11 17.55129 630.8886 20 12 18.34758 1076.6810 19 13 16.37789 1778.5550 20 14 14.98782 1401.4328 17 15 17.40527 361.3323 20 16 16.53979 869.5829 21 17 16.61986 1712.1686 19 18 17.78508 1961.6090 20 19 16.83144 1043.5052 19 20 18.66166 360.3037 20

When plotted, these measurements look like this (using the R programming language):

plot(density(my_data$age), main = "Density of Age Measurements (Years) ") plot(density(my_data$height), main = "Density of Height Measurements (Feet) ") plot(density(my_data$weight), main = "Density of Weight Measurements (Pounds) ")

However, when we consult the leading giraffe experts - they tell us the following information that we want to use as priors:

In the wild, the heights of giraffes roughly follow a normal distribution with an average height of 17ft and a variance of 1ft
In the wild, the weight of giraffes roughly follow a normal distribution with an average height of 3000 lbs and a variance of 200 lbs
In the wild, the age of giraffes roughly follow a normal distribution with an average height of 50 years and a variance of 5 years.

The distributions of these priors would look something like this:

plot(density(rnorm(100000, 50,5)), main = "Prior Distribution of Age") plot(density(rnorm(100000, 17,1)), main = "Prior Distribution of Height") plot(density(rnorm(100000, 3000, 200)), main = "Prior Distribution of Weight")

We can also plot the distributions of the measurements and the distributions of the priors on the same graphs for comparison purposes:

My Question:

1) Frequentist Linear Regression and Bayesian Linear Regression: If we were to fit a frequentist linear regression model (i.e. the basic linear regression model you learn in any intro stats class) to the data we have measured:

model_1 <- lm(age ~ weight + height, data = my_data_1) > summary(model_1) Call: lm(formula = age ~ weight + height, data = my_data_1) Residuals: Min 1Q Median 3Q Max -11.257 -2.957 1.096 4.621 10.526 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.501065 30.952848 1.050 0.308408 weight 0.011480 0.002867 4.005 0.000918 *** height -1.356598 1.649824 -0.822 0.422308

This equation has the following form : Age = 32.501 + 0.01 * beta_weight - 1.35 * beta_height

If we wanted to fit a Bayesian Linear Regression model to this data, we would have to specify priors on "beta_weight" and "beta_height" (not on "weight" and "height"). I tried to write out the estimation equations for the Bayesian Linear Regression model below:

As seen above, in the Bayesian Linear Regression model - the priors are specified on the regression parameters directly. However, it remains unclear : how can we use the real world prior information we have about the variables to specify prior distributions for the regression parameters?

library(rstanarm) #specify priors on beta regression parameters my_prior <- normal(location = c(???, ???), scale = c(???, ???)) #run bayesian regression model model_2 <- stan_glm(age~., data=my_data, prior = my_prior, seed=111)

This leads me to a potential solution I thought of for this problem - perhaps we could model this data directly with a joint probability distribution function? This would allow us to directly use the prior information available to us from the giraffe experts.

2) Probability Distributions : I thought of the following idea - if we were to directly fit a joint probability distribution to the data instead of a regression model, we would then be able to better make sure of the real world prior information available on the variables. Below, I wrote out the estimation equations as to how a Multivariate Normal Distribution (MVN) could be fit to the giraffe data using Bayesian Priors:

Using R, we can can easily fit a joint probability distribution to the giraffe data using the Bayesian Priors specified to us by the giraffe experts:

library(mclust) #define covariance matrix sigma1.pre <- c(1, 0 , 0 , 0, 200, 0, 0, 0, 5) sigma1 <- matrix(sigma1.pre, nrow=3, ncol= 3, byrow = TRUE) #fit model mvn <- Mclust(my_data, G =1, prior = priorControl(mean = c(17, 3000, 50), scale = sigma1)) #mvn_1 <- Mclust(my_data, G =1, modelNames = "EEI", prior = priorControl(mean = c(17, 3000, 50), scale = sigma1)) #view results summary(mvn, parameters = TRUE) log-likelihood n df BIC ICL -262.3276 20 9 -551.6168 -551.6168 Means: [,1] height 15.69963 weight 2025.42602 age 34.45777 Variances: [,,1] height weight age height 2.043062 -965.0237 -13.80214 weight -965.023737 665810.2178 8954.51107 age -13.802138 8954.5111 149.87889

As a result, we have fit a Multivariate Normal Distribution to the giraffe data and were able to directly make use of the prior knowledge given to us by the giraffe experts.

Can someone please tell me if my idea is correct (I don't think it is because the variance for "weight" was predicted as 665810)? By fitting probability distribution functions to the data instead of regression models, does this allow us to directly use as well as make better use of the real world prior information we have on the variables?

Thanks!

Extra: Predicting New Observations Using Probability Distributions via MCMC Sampling (in R)

We all know how linear regression models (whether Frequentist or Bayesian) can be used to predict new observations. However, predicting new observations using probability distributions is not as straightforward and usually requires using Monte Carlo Sampling Algorithms (e.g. Metropolis-Hastings) to randomly sample from the conditional probability distribution. For the probability distribution I fit in this question - suppose we want to predict the age of a giraffe that weighs 2900 lbs and is 16.1 ft tall. I demonstrate how this can be done in R:

 sigma1 <- c(2.04 , -965.02 , - 13.80 , -965.02 , 665810.2 , 8954.51 , -13.80 , 8954.511 , 149.88) sigma <- matrix(sigma1, nrow=3, ncol= 3, byrow = TRUE) sigma_inv <- solve(sigma) sigma_det <- det(sigma) denom = sqrt( (2*pi)^3 * sigma_det) target <- function(x) { x_one = 16.1 - 15.69 x_two = 2900 - 2025.42 x_three = x - 1.4620007 x_t = c(x_one, x_two, x_three) x_t_one <- matrix(x_t, nrow=3, ncol= 1, byrow = TRUE) x_t_two = matrix(x_t, nrow=1, ncol= 3, byrow = TRUE) # In this part, as it's (x-mu)^T * SIGMA * (x-mu) num = exp(-0.5 * x_t_two %*% sigma_inv %*% x_t_one) answer_1 = num/denom return(answer_1) } library(mvtnorm) x = rep(0,10000) x[1] = 15 #initialize for(i in 2:10000){ current_x = x[i-1] new = rmvnorm(n=1, mean=c(current_x), sigma=diag(1), method="chol") # generate bivariate random numbers proposed_x = new[1] A = target(proposed_x)/target(current_x) if(runif(1)<A){ x[i] = proposed_x # accept move with probabily min(1,A) } else { x[i] = current_x # otherwise "reject" move, and stay where we are } }

Now, we can see the posterior distribution for "Age":

hist(x)

This histogram indicates the average agfe of a giraffe that weighs 2900 lbs and is 16.1 ft tall will be 9 years old:

mean(x) 9.755

Sergio · Accepted Answer · 2021-10-26 07:59:26Z

I can think of two ways in which you can establish prior distributions on regression parameters.

The first one is standardizing the data (subtracting the mean and dividing by the standard deviation), then you know that the effect of any single covariate on your variable of interest is very unlikely to create a variation of more than two or three standard deviations on the variable of interest. In that case, you can use a standard normal as a prior for all parameters. This is called a weakly informative prior.

Alternatively, you can use prior predictive checks. You already have the knowledge of how your variables are approximately distributed, so you can simply generate data according to the data generation process you are assuming until a distribution matches what you know. Using the example you have above: you want to predict age from weight and height. You know the approximate distribution of all of them. Simply generate a bunch of samples from weight and height from the distributions above, and then sample some parameters parameters from a simple distribution, say a standard normal. Then check whether with such parameters, and the fake data the age is reasonable with respect to what you already know. If that is not the case, you will have to adjust the prior on the parameters. Notice that there are not changes in the distribution of your covariates, as you assume you have information of these. And also notice that you are not violating any Bayesian principle because you are only using prior information in order to establish your priors. You are not using the data itself.

For practical Bayesian inference, I highly recommend the paper "Bayesian workflow" by some of my favorite authors in Bayesian inference. Here goes the link: https://arxiv.org/abs/2011.01808

PS. I found a bit strange one of your prior bullet points: "For simplicity sake, we assume that there are no correlations and covariances between these variables." Why then, would you run a linear regression on age given the other variables? can you elaborate, please?

: thank you for your answer! I look forward to reading the paper on bayesian workflow! I also removed that sentence - i also realized it doesn't make sense. — stats_noob
– stats_noob, Commented Oct 26, 2021 at 8:15
Could you please provide some general comments if what I did in step 2) makes statistical sense (i.e. fitting a probability distribution to the data)? For predicting new values, does it also make sense to sample the conditional distribution using the metropolis Hasting algorithm as I have done? Thank you so much! — stats_noob
– stats_noob, Commented Oct 26, 2021 at 8:18
Yes, fitting the full joint distribution also makes sense. As, from your prior information, you know that all variables are gaussian. With respect to the conditional distribution, unfortunately I can't check right now the correctness of the code, but if I were to generate conditional distributions from a multivariate normal, I would simply use the formula (see "conditional distributions" here en.wikipedia.org/wiki/Multivariate_normal_distribution ). — Sergio
– Sergio, Commented Oct 26, 2021 at 8:23
Yes, in addition to speed consideration and the "interpretability" of regressions, you might not have information of the observation model (sampling function/likelihood of the parameters) of your data. The situation you described above is highly idealised. To that, I would also add that statistical inference should be driven by questions. That is, what answer do you want to give with the inference you are doing now? you might be able to answer more questions by modeling everything at the same time, but is it necessary? — Sergio
– Sergio, Commented Oct 27, 2021 at 7:28
Because you don't always have the sampling function of all of your variables. In general you know roughly the sampling function of your outcome variable conditioned on the rest of the variables, but knowing (from prior information) the marginal distribution of a particular variable is not common. Let's say you want to model years of education with respect to certain covariates that include income. Conditioned on all relevant variables, you expect that the years of education are normally distributed, but knowing the distribution of income is not easy. Is it half-cauchy? is it log-normal? — Sergio
– Sergio, Commented Oct 28, 2021 at 7:52

Stack Exchange Network

Bayesian Priors in the Real World : Regression Models vs Probability Distributions

1 Answer 1

Linked

Hot Network Questions

Bayesian Priors in the Real World : Regression Models vs Probability Distributions

1 Answer 1

Linked

Related

Hot Network Questions