Recently, I posted the a question on "How Bayesian Priors are Decided in Real Life" (How are Bayesian Priors Decided in Real Life?). In this question, I created an example where researchers are collecting "height, weight and age" measurements on giraffes for the purpose of predicting the age of a giraffe using the height and weight (e.g. linear regression) . Furthermore, the researchers are interested in using well-established biological information about giraffes as "Bayesian Priors".
When posting this question, I realized a potential problem that might exist when incorporating previous known information into the Bayesian Priors: How can the "prior" distributions of the variables we are using (e.g. the distribution of the "height" and "weight" variables in the natural world) be incorporated into their corresponding model parameter distributions (e.g. in a regression model : beta_height and beta_weight)?
To illustrate my problem (and continue the giraffe example), suppose we collected (directly observed) the following data collected on giraffes (20 Measurements - Realistic Example with Possible Measurement Errors, e.g. a giraffe with height = 17.55129 feet, weight = 630.8886 lbs and age = 20):
height weight age 1 13.14600 2882.7709 49 2 12.65080 3183.7991 48 3 13.84154 3138.2280 48 4 15.25780 2786.5297 49 5 15.01213 3006.9687 50 6 14.37567 3286.9644 50 7 12.99385 2881.7667 51 8 15.38893 2916.1883 50 9 14.80093 2791.7292 49 10 15.40423 2427.7706 50 11 17.55129 630.8886 20 12 18.34758 1076.6810 19 13 16.37789 1778.5550 20 14 14.98782 1401.4328 17 15 17.40527 361.3323 20 16 16.53979 869.5829 21 17 16.61986 1712.1686 19 18 17.78508 1961.6090 20 19 16.83144 1043.5052 19 20 18.66166 360.3037 20 When plotted, these measurements look like this (using the R programming language):
plot(density(my_data$age), main = "Density of Age Measurements (Years) ") plot(density(my_data$height), main = "Density of Height Measurements (Feet) ") plot(density(my_data$weight), main = "Density of Weight Measurements (Pounds) ") However, when we consult the leading giraffe experts - they tell us the following information that we want to use as priors:
In the wild, the heights of giraffes roughly follow a normal distribution with an average height of 17ft and a variance of 1ft
In the wild, the weight of giraffes roughly follow a normal distribution with an average height of 3000 lbs and a variance of 200 lbs
In the wild, the age of giraffes roughly follow a normal distribution with an average height of 50 years and a variance of 5 years.
The distributions of these priors would look something like this:
plot(density(rnorm(100000, 50,5)), main = "Prior Distribution of Age") plot(density(rnorm(100000, 17,1)), main = "Prior Distribution of Height") plot(density(rnorm(100000, 3000, 200)), main = "Prior Distribution of Weight") We can also plot the distributions of the measurements and the distributions of the priors on the same graphs for comparison purposes:
My Question:
1) Frequentist Linear Regression and Bayesian Linear Regression: If we were to fit a frequentist linear regression model (i.e. the basic linear regression model you learn in any intro stats class) to the data we have measured:
model_1 <- lm(age ~ weight + height, data = my_data_1) > summary(model_1) Call: lm(formula = age ~ weight + height, data = my_data_1) Residuals: Min 1Q Median 3Q Max -11.257 -2.957 1.096 4.621 10.526 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.501065 30.952848 1.050 0.308408 weight 0.011480 0.002867 4.005 0.000918 *** height -1.356598 1.649824 -0.822 0.422308 This equation has the following form : Age = 32.501 + 0.01 * beta_weight - 1.35 * beta_height
If we wanted to fit a Bayesian Linear Regression model to this data, we would have to specify priors on "beta_weight" and "beta_height" (not on "weight" and "height"). I tried to write out the estimation equations for the Bayesian Linear Regression model below:
As seen above, in the Bayesian Linear Regression model - the priors are specified on the regression parameters directly. However, it remains unclear : how can we use the real world prior information we have about the variables to specify prior distributions for the regression parameters?
library(rstanarm) #specify priors on beta regression parameters my_prior <- normal(location = c(???, ???), scale = c(???, ???)) #run bayesian regression model model_2 <- stan_glm(age~., data=my_data, prior = my_prior, seed=111) This leads me to a potential solution I thought of for this problem - perhaps we could model this data directly with a joint probability distribution function? This would allow us to directly use the prior information available to us from the giraffe experts.
2) Probability Distributions : I thought of the following idea - if we were to directly fit a joint probability distribution to the data instead of a regression model, we would then be able to better make sure of the real world prior information available on the variables. Below, I wrote out the estimation equations as to how a Multivariate Normal Distribution (MVN) could be fit to the giraffe data using Bayesian Priors:
Using R, we can can easily fit a joint probability distribution to the giraffe data using the Bayesian Priors specified to us by the giraffe experts:
library(mclust) #define covariance matrix sigma1.pre <- c(1, 0 , 0 , 0, 200, 0, 0, 0, 5) sigma1 <- matrix(sigma1.pre, nrow=3, ncol= 3, byrow = TRUE) #fit model mvn <- Mclust(my_data, G =1, prior = priorControl(mean = c(17, 3000, 50), scale = sigma1)) #mvn_1 <- Mclust(my_data, G =1, modelNames = "EEI", prior = priorControl(mean = c(17, 3000, 50), scale = sigma1)) #view results summary(mvn, parameters = TRUE) log-likelihood n df BIC ICL -262.3276 20 9 -551.6168 -551.6168 Means: [,1] height 15.69963 weight 2025.42602 age 34.45777 Variances: [,,1] height weight age height 2.043062 -965.0237 -13.80214 weight -965.023737 665810.2178 8954.51107 age -13.802138 8954.5111 149.87889 As a result, we have fit a Multivariate Normal Distribution to the giraffe data and were able to directly make use of the prior knowledge given to us by the giraffe experts.
Can someone please tell me if my idea is correct (I don't think it is because the variance for "weight" was predicted as 665810)? By fitting probability distribution functions to the data instead of regression models, does this allow us to directly use as well as make better use of the real world prior information we have on the variables?
Thanks!
Extra: Predicting New Observations Using Probability Distributions via MCMC Sampling (in R)
We all know how linear regression models (whether Frequentist or Bayesian) can be used to predict new observations. However, predicting new observations using probability distributions is not as straightforward and usually requires using Monte Carlo Sampling Algorithms (e.g. Metropolis-Hastings) to randomly sample from the conditional probability distribution. For the probability distribution I fit in this question - suppose we want to predict the age of a giraffe that weighs 2900 lbs and is 16.1 ft tall. I demonstrate how this can be done in R:
sigma1 <- c(2.04 , -965.02 , - 13.80 , -965.02 , 665810.2 , 8954.51 , -13.80 , 8954.511 , 149.88) sigma <- matrix(sigma1, nrow=3, ncol= 3, byrow = TRUE) sigma_inv <- solve(sigma) sigma_det <- det(sigma) denom = sqrt( (2*pi)^3 * sigma_det) target <- function(x) { x_one = 16.1 - 15.69 x_two = 2900 - 2025.42 x_three = x - 1.4620007 x_t = c(x_one, x_two, x_three) x_t_one <- matrix(x_t, nrow=3, ncol= 1, byrow = TRUE) x_t_two = matrix(x_t, nrow=1, ncol= 3, byrow = TRUE) # In this part, as it's (x-mu)^T * SIGMA * (x-mu) num = exp(-0.5 * x_t_two %*% sigma_inv %*% x_t_one) answer_1 = num/denom return(answer_1) } library(mvtnorm) x = rep(0,10000) x[1] = 15 #initialize for(i in 2:10000){ current_x = x[i-1] new = rmvnorm(n=1, mean=c(current_x), sigma=diag(1), method="chol") # generate bivariate random numbers proposed_x = new[1] A = target(proposed_x)/target(current_x) if(runif(1)<A){ x[i] = proposed_x # accept move with probabily min(1,A) } else { x[i] = current_x # otherwise "reject" move, and stay where we are } } Now, we can see the posterior distribution for "Age":
hist(x) This histogram indicates the average agfe of a giraffe that weighs 2900 lbs and is 16.1 ft tall will be 9 years old:
mean(x) 9.755 




