Simulating Observations from the Prior Distribution?

Question

Can you directly simulate data from Bayesian Priors and "add them as new rows" to your existing dataset - and then fit a statistical model to this new dataset?

In the context of statistical modelling, I could see this as having two advantages:

1) Simplify computations and avoid the need of MCMC sampling

2) Better exploit real world knowledge about your variables (Bayesian Regression requires you to specify prior distributions on the model parameters (i.e. regression beta coefficients) - when in real life, you are more likely to have knowledge about the prior distributions of the actual variables, and not the model parameters).

For example, suppose you have the following data about the height, weight and age of giraffes. Suppose you are interested in creating a regression model that predicts "age" as a function of "height and weight". Suppose I have the following measurements:

 weight height age 1 2998.958 15.26611 53 2 3002.208 18.08711 52 3 3008.171 16.70896 49 4 3002.374 17.37032 55 5 3000.658 18.04860 50 6 3002.688 17.24797 45 7 3004.923 16.45360 47 8 2987.264 16.71712 47 9 3011.332 17.76626 50 10 2983.783 18.10337 42 11 3007.167 18.18355 50 12 3007.049 18.11375 53 13 3002.656 15.49990 42 14 2986.710 16.73089 47 15 2998.286 17.12075 52

Now, suppose a giraffe expert tells me the following information that I think can be used as priors:

The average age of a giraffe is 50 years, with a "bell curve shape"
The average height of a giraffe is 16 feet, with a "bell curve shape"
The average weight of a giraffe is 2500 lbs, with a "bell curve shape"

Could I simulate data from this (e.g. suppose I assume that each of these variables has a normal distribution and (for simplicity sake) is not correlated to the other variables), and add them to the original data? For example (using the R programming language):

age = rnorm(10, 50, 1) height = rnorm(10, 16,1) weight = rnorm(10, 2500, 100) bayesian_data = data.frame(age, height, weight) bayesian_data$source = "simulated" old_data$source = "observed" combined_data = rbind(old_data, bayesian_data)

The new data ("combined_data") would look as follows:

 weight height age source 2998.958 15.26611 53.00000 observed 3002.208 18.08711 52.00000 observed 3008.171 16.70896 49.00000 observed 3002.374 17.37032 55.00000 observed 3000.658 18.04860 50.00000 observed 3002.688 17.24797 45.00000 observed 3004.923 16.45360 47.00000 observed 2987.264 16.71712 47.00000 observed 3011.332 17.76626 50.00000 observed 2983.783 18.10337 42.00000 observed 3007.167 18.18355 50.00000 observed 3007.049 18.11375 53.00000 observed 3002.656 15.49990 42.00000 observed 2986.710 16.73089 47.00000 observed 2998.286 17.12075 52.00000 observed 2563.850 15.93615 49.55841 simulated 2646.080 15.85238 49.71639 simulated 2336.807 16.24114 49.35653 simulated 2467.514 15.67480 49.67832 simulated 2501.552 16.55515 48.07730 simulated 2582.426 16.33497 50.51050 simulated 2549.077 16.20062 48.55126 simulated 2595.276 16.80354 50.72527 simulated 2429.517 16.90839 49.48397 simulated 2497.516 14.25892 49.33416 simulated

Then, I could directly fit a regression model to this data that would "Bayesian in Spirit":

model <- lm(age ~ height + weight, data = combined_data) Call: lm(formula = age ~ height + weight, data = combined_data) Residuals: Min 1Q Median 3Q Max -7.570 -1.602 0.416 1.091 5.879 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45.512105 10.789639 4.218 0.000354 *** height 0.560421 0.742336 0.755 0.458290 weight -0.002040 0.003024 -0.675 0.506924 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.132 on 22 degrees of freedom Multiple R-squared: 0.02989, Adjusted R-squared: -0.0583 F-statistic: 0.3389 on 2 and 22 DF, p-value: 0.7162

My Question: Does the approach I used make sense?

I do not think what I have done is mathematically correct - I am not sure if you can just simulate data from your priors and add it to your original data.
The approach I listed would require you to assume priors on all your variables, even if you only believed that you had reasonable prior information on only some of your variables. In a traditional Bayesian setting, you can just place "uninformative priors" on variables you did not have reliable information about, and place "informative priors" on the variables that you did have reliable information about (I suppose for any given row, you could just randomly sample your existing data and re-use existing values for some variables and simulate values for the other variables).
The approach I listed partly defeats the purpose of Bayesian Modelling - my approach does not provide the "much desired" posterior distributions and credible intervals of the parameters, as is traditionally associated with Bayesian Models. Although I suppose you re-run the simulation 1000 times, create 1000 new datasets where the results of each simulation is appended to the original data - then you can fit a linear regression model to each of the 1000 datasets, and plot a histogram for the estimates of each parameter (i.e. the regression beta coefficients) ... but this will probably become computationally expensive and defeat the original purpose of this approach (i.e. the original purpose of this approach was to reduce computation time by avoiding MCMC and directly simulating from the priors).
And finally, in the above example, I had 15 observations and simulated 10 observations from the priors. But why 10? Would 11 have been a better choice - why not 8? In the traditional Bayesian setting, you are not faced with this decision.

I am highly skeptical myself if what I have done is correct (as in the reasons I have just listed) - but can someone please provide some additional comments?

Thanks!

gibson25 · Accepted Answer · 2021-11-22 04:32:46Z

Two problems with this approach:

In a Bayesian regression model, the parameters to be estimated are the regression coefficients, so any priors you have would be on those, not on the distributions of age, height, and weight in the population.
If you simulate data in the way you described, then the new age, height, and weight values will be uncorrelated since you used independent univariate normal draws to generate them. This would bias any resulting regression coefficient estimates toward zero.

One thing your question brings to mind is the idea of prior predictive checking, which is a way of evaluating whether your prior distributions result in reasonable beliefs about your observed data. You could do this in a manner similar to what you coded above, but you would need to specify priors on the parameters, then simulate data based on the resulting linear models. If you haven't seen it already, I would recommend Ch 4.3 from Peter Hoff's book A First Course in Bayesian Statistical Methods.

@ gibson25 : Thank you for your answer! Regarding the 2nd point you raised - I also thought of this. If someone were to attempt this idea, they should probably try to find a multivariate distribution between the variables to better address correlations and dependencies in the data, thus attempting to target possible biases during the simulation process. However, for the sake of simplicity for this Stackoverflow question - I simplified this problem by assuming all 3 variables are uncorrelated in the real world, and thus simulated from univariate distributions. — stats_noob
– stats_noob, Commented Nov 22, 2021 at 4:37
I agree that the main issue is the difficulty in simulating the correlation between the covariates. An cheap expedient would be to use bootstrap on the available covariates, but with a small sample size this has little justification. — Xi'an
– Xi'an, Commented Nov 22, 2021 at 5:52
@ Xi'an : Thank you for your reply! Regarding "difficulty in simulating the correlation between the covariates" - perhaps MCMC samples could be taken from a multivariate distribution of the covariates? — stats_noob
– stats_noob, Commented Nov 22, 2021 at 7:01

Stack Exchange Network

Simulating Observations from the Prior Distribution?

1 Answer 1

Hot Network Questions

Simulating Observations from the Prior Distribution?

1 Answer 1

Related

Hot Network Questions