How should I treat categorical variables in Bayesian modelling

Question

I have got a dataframe that contains three categorical predictors and one numerical response. I would like to compare their differences using posterior uncertainty intervals of MCMC draws. The reason for this is that the data has got many outliers that affect distribution and frequentest representations but are likely too important to be excluded (i.e. depictions of actual effect). So I would like to express my findings in such a way that says: there's a n% probability that my estimates capture the true effect.

I am still pretty new to the Bayesian methods so what I have done mostly relied on online material. I have tried using mcmc_intervals of the bayesplot package, the codes are:

fit1<-stan_glm(proportion~plot,data=faci_2,iter=1000,seed=0512) fit2<-stan_glm(proportion~year,data=faci_2,iter=1000,seed=0512) fit3<-stan_glm(proportion~type,data=faci_2,iter=1000,seed=0512) posterior1<- as.array(fit1) posterior2<- as.array(fit2) posterior3<- as.array(fit3) mcmc_intervals(posterior1) mcmc_intervals(posterior2) mcmc_intervals(posterior3)

I was modelling them separately to emphasize individual effects (debatable, yes). I am thinking about controlling for other factors when the current issue has been sorted. Also, there is no assumption for the prior distribution (or uninformative).

So what I would get is a graph that looks like this: [ CIs by plot 2

but I would like to include all categories in the CI graph, not with one of them treated as a baseline or whatever. I imagine categories are treated here the same way as I would dummy variables in a regression, so that one of them would actually be represented by the intercept. But it's not very intuitive as a result. So can someone show me the correct way to compare all of them using credible intervals? Or should I be using some other method to reach my goals? Thank you in advance, and do pardon me if I am not mathematically coherent here.

Pasting some data here for reproduction:

Guillem · Accepted Answer · 2019-06-28 09:54:27Z

0

If you model the variables separately, you could just do a regression without an intercept? Something like this I guess:

stan_glm(proportion ~ plot - 1, data = faci_2)

Or you could process the posterior draws of the parameters in order to remove the intercept. For example, in your figure, you have "Intercept", "plotb1" and "plotc1":

"Intercept" would correspond to the estimate of group D.
"Intercept" + "plotb1" correspond to the estimate of group B.
"Intercept" + "plotc1" correspond to the estimate of group C.

By the way, uninformative priors might not be ideal, see this.

answered Jun 28, 2019 at 9:54

Guillem

3851 silver badge7 bronze badges

$\begingroup$ Thank you for the comments, especially the second part about the priors, it was a detailed and informative document. It might be ideal to choose a prior based on the standard error of my estimate. As what I have describes a possibly tiny but meaningful effect, I was actually worried that setting an informative prior would somehow pool my results toward 0. $\endgroup$

user2927760
– user2927760

2019-06-30 03:54:39 +00:00
Commented Jun 30, 2019 at 3:54
$\begingroup$ One followup question, if I were to do a regression with all the varaibles, such as proportion~plotyeartype, how would I do it so that all the variable levels are displayed instead of having one hidden like the issue I had before? Thank you in advance. @Guillem $\endgroup$

user2927760
– user2927760

2019-08-03 08:37:14 +00:00
Commented Aug 3, 2019 at 8:37
$\begingroup$ In short you can't, because the columns of your design matrix will be linearly dependent. Your baseline will consist of the baseline for plot, the baseline for year and the baseline for type. $\endgroup$

Guillem
– Guillem

2019-08-03 10:16:30 +00:00
Commented Aug 3, 2019 at 10:16
$\begingroup$ Thanks a lot, seperate analysis is in fact good enough for what I am after at the moment. Will explore other global comparison possibilities. @Guillem $\endgroup$

user2927760
– user2927760

2019-08-04 06:39:19 +00:00
Commented Aug 4, 2019 at 6:39

Add a comment |

Stack Exchange Network

How should I treat categorical variables in Bayesian modelling

1 Answer 1

Hot Network Questions

How should I treat categorical variables in Bayesian modelling

1 Answer 1

Related

Hot Network Questions