7
$\begingroup$

I have some experimental data that I'm trying to analyze. I have 1 response variable and 3 explanatory variables (these are factor variables). The explanatory variables are the presence of a disease(positive and negative), a genetic profile (X and Y), and whether or not an MRI contrast agent was given (YES and NO).

Structure of data looks like this:

 measurement profile disease contrast 1 -1.76269 X NEG YES 2 -0.34492 X NEG NO 3 0.57455 X POS YES 4 2.16539 X POS NO . . . . . . . . . . . . 77 -1.76269 Y NEG YES 78 -0.34492 Y NEG NO 79 0.57455 Y POS YES 80 2.16539 Y POS NO 

I looked into using ANOVA for this analysis but the post hoc Tukey HSD looks at all possible combinations of the explanatory variables so it makes far more comparisons than I actually care about.

We have some specific hypotheses that, e.g.: X.NEG.NO will differ from Y.NEG.NO, X.NEG.NO will differ form X.NEG.YES, X.NEG.NO will differ from X.POS.NO, etc.

(notice that each compared group "consist" from interaction of all three variables)

How to get only some specific comparisons from TukeyHSD? Is appropriate to make Is there a better approach?

Reproducible example: my.data <- data.frame(measurement = rnorm(80), my.profile = rep(c("X","Y"), each = 40), my.disease = rep(c("NEG","NEG","POS","POS"), times=20), my.contrast = rep(c("NO","YES"), times = 40)) 
$\endgroup$
7
  • $\begingroup$ I don't have much background in DOE, but I'll try to answer the best I can. I don't think this could be considered an orthogonal design. The full set of comparisons I need to make is: X.NEG.NO vs. Y.NEG.NO, X.POS.NO vs Y.POS.NO; X.NEG.NO vs X.NEG.YES Y.NEG.NO vs Y.NEG.YES; X.POS.NO vs X.POS.YES, Y.POS.NO vs Y.POS.YES; X.NEG.NO vs X.POS.NO, Y.NEG.NO vs Y.POS.NO; X.NEG.YES vs X.POS.YES, Y.NEG.YES vs Y.POS.YES $\endgroup$ Commented Mar 6, 2014 at 14:20
  • $\begingroup$ All other comparisons are meaningless.I also have 3 different regions that I want to consider all these comparisons in, but they are separate so it doesn't make sense to compare across them. $\endgroup$ Commented Mar 6, 2014 at 14:28
  • $\begingroup$ For some comparisons they do but for others they don't (e.g. n=11 vs n=9). $\endgroup$ Commented Mar 6, 2014 at 14:34
  • $\begingroup$ Thanks! Do I perform diagnostics with the summary command? Also, is there a name for this process so that I can cite it when I write up my results? And when you say I can only compare based on profile and contrast, that means I should perform a bivariate test ignoring the disease factor? (e.g. t.test(data$measurement[data$profile=="X"], data$measurement[data$profile == "Y"] $\endgroup$ Commented Mar 7, 2014 at 13:35
  • $\begingroup$ No, please edit away if you think it makes it clearer. $\endgroup$ Commented Mar 7, 2014 at 15:16

2 Answers 2

15
$\begingroup$

Suppose that we have one response variable and one explanatory variable (5 levels).

con.data <- data.frame(x = c(rnorm(75), c(rnorm(75)+5)), category = rep(c("A","B1","B2","C1","C2"), each=30)) 

If we do ANOVA followed by classical TukeyHSD post-hoc...

m1 <- aov(con.data$x ~ con.data$category); summary(m1) TukeyHSD(m1) diff lwr upr p adj B1-A 0.1877877 -0.9109837 1.286559 0.9897357 B2-A 2.2050543 1.1062829 3.303826 0.0000013 C1-A 4.5951737 3.4964022 5.693945 0.0000000 C2-A 4.7790488 3.6802773 5.877820 0.0000000 B2-B1 2.0172666 0.9184952 3.116038 0.0000117 C1-B1 4.4073860 3.3086145 5.506157 0.0000000 C2-B1 4.5912611 3.4924896 5.690033 0.0000000 C1-B2 2.3901193 1.2913479 3.488891 0.0000001 C2-B2 2.5739944 1.4752230 3.672766 0.0000000 C2-C1 0.1838751 -0.9148963 1.282647 0.9905238 

...we obtain all possible combinations.

If you want to make only comparison that are interesting to you, use CONTRASTS.

In R there are several default contrast matrices (overview):

treatment, helmert, sum , polynomial... but you can create your own.

First - decide which comparisons you are interested in.

Then create a contrast matrix:

contrasts(con.data$category) <- cbind(c(1,-1/4,-1/4,-1/4,-1/4), c(0,-1/2,-1/2,1/2,1/2), c(0,0,0,1/2,-1/2), c(0,-1/2,1/2,0,0)) 

look at this table for reference:

contrasts(con.data$category) [,1] [,2] [,3] [,4] A 1.00 0.0 0.0 0.0 B1 -0.25 -0.5 0.0 -0.5 B2 -0.25 -0.5 0.0 0.5 C1 -0.25 0.5 0.5 0.0 C2 -0.25 0.5 -0.5 0.0 

Take notice that sum of each column is equal to 0.

If you have 5 levels in a factor, there can be only 4 comparisons, due to degrees of freedom.

In the first column you compare mean of "A" with mean of all others categories.

In the second column you compare only category "B" with "C" (B1+B2 vs. C1+C2).

In the third column you compare only within "C" category (C1 vs. C2).

In the fourth column you compare only within "B" category (B1 vs. B2).


To see the results re-make the ANOVA with created contrast matrix.

summary(lm(con.data$x, con.data$category)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.5910 0.1258 20.599 < 2e-16 *** con.data$category1 -2.3534 0.2516 -9.355 < 2e-16 *** con.data$category2 3.4907 0.2813 12.411 < 2e-16 *** con.data$category3 -0.1839 0.3978 -0.462 0.645 con.data$category4 2.0173 0.3978 5.072 1.19e-06 *** 

...and each row in the table corresponds to each comparison (column in contrast matrix) made. (i.e. con.data$category1 is significant so there is significant difference between mean of "A" vs. mean of all others groups...etc.)

In short:

Try to make a contrast matrix containing only comparisons you are interested in. With the example above it should not be difficult.


However !!!

I would not use post-hoc (or contrasts) on data immediately. It is like to teach a model to "run" before it can "walk". So as a first thing I would create a model containing all variables. Subsequently, remove all non-significant variables (or their interaction) according to marginality rule. This procedure (reduction) will determine if all your desired comparisons are necessary.

So try to make full model:

m1 <- glm(measurement ~ my.profile*my.contrast*my.disease, data = my.data) anova(m1, test = "F") 

For example, if factor "disease" will not be significant (alone or in interaction - it should not be included in post-hoc.

Suppose the results of m1 model will look like this:

my.profile 0.00012 ** my.contrast 0.00231 * my.disease 0.07690 . my.profile:my.contrast 0.26159 my.profile:my.disease 0.07709 . my.contrast:my.disease 0.21256 my.profile:my.contrast:my.disease 0.23319 

Use the rule of marginality:

update no. 1: you can see that triple interaction is non-significant - let's remove it

m2 <- update(m1, ~.-my.profile:my.contrast:my.disease) 

Now, show the anova of updated model:

anova(m2, test = "F") my.profile 0.00012 ** my.contrast 0.00231 * my.disease 0.07690 . my.profile:my.contrast 0.26159 my.profile:my.disease 0.07709 . my.contrast:my.disease 0.21256 

update no. 2: you can see that double interactions are non-significant - let's remove them (start with double interaction with highest p-value)

m3 <- update(m2, ~.-my.profile:my.contrast) anova(m3, test = "F") my.profile 0.00012 ** my.contrast 0.00231 * my.disease 0.07690 . my.profile:my.disease 0.07709 . my.contrast:my.disease 0.21256 

update no. 3: remove another double interaction

m4 <- update(m3, ~.-my.contrast:my.disease) anova(m4, test = "F") my.profile 0.00012 ** my.contrast 0.00231 * my.disease 0.07690 . my.profile:my.disease 0.07709 . 

update no. 4: remove the last double interaction

m5 <- update(m4, ~.-my.profile:my.disease) anova(m5, test = "F") my.profile 0.00012 ** my.contrast 0.00231 * my.disease 0.07690 . 

update no. 5: remove non-significant factor

m6 <- update(m5, ~.-my.disease) anova(m6, test = "F") my.profile 0.00012 ** my.contrast 0.00231 * 

Model m6 is our final model.

Unfortunately it is obvious that making comparisons (Y.NEG.NO, X.NEG.NO and others) has no sense, because the triple and double interaction are non-significant as well. And it would not be correct to select the desired rows from TukeyHSD (even if such post-hoc will show significant difference !!!). Believe me, such approach will be very hard to defend in peer-review process. So you can make only comparison in profile (X vs. Y) and contrast (NO vs. YES). Disease factor is non-significant. Do not be sad - even non-significant result is a result.

Marginality rule is the practical application of Occam's razor (See also in Crawley, M.J. Statistics: An Introduction Using R. 2nd ed. Wiley, 2014, Ch. 10 Multiple Regression, p. 195). A good model is always the simplest one - and explains the largest portion of variability in data.

You can publish this result in an article as a "full model" (containing all the factors and their interaction(s)) or as a "minimal adequate model" (MAM, containing only significant effects). I would prefer to include both versions into a manuscript and let reviewers to decide which one to prefer.

The point is not to fishing for p-values in post-hoc tests when ANOVA results are non-significant.

$\endgroup$
7
  • $\begingroup$ Great, this is has really helped set me on the right course. But for multiple factors, do I need multiple matrices, a higher dimensional array, or can it be done with one matrix the way you describe? $\endgroup$ Commented Mar 6, 2014 at 16:38
  • $\begingroup$ smallData <- data.frame(measurement = c(rnorm(5),rnorm(7),rnorm(23-(5+7))), profile = c(rep("X",7),rep("Y",23-7)), contrast = c(rep("YES",10),rep("NO",23-10)),disease = c(rep("POS",14),rep("NEG",23-14))) $\endgroup$ Commented Mar 6, 2014 at 16:59
  • $\begingroup$ I'm looking to model measurement ~ profile + contrast + disease but I'm only interested in the comparisons listed above. $\endgroup$ Commented Mar 6, 2014 at 17:02
  • $\begingroup$ There are 82 rows overall. $\endgroup$ Commented Mar 6, 2014 at 18:35
  • $\begingroup$ I've added some example code to reproduce the format of my data. Thanks! $\endgroup$ Commented Mar 6, 2014 at 19:18
0
$\begingroup$

Running step on your model will also work.

# Full model m1.0 = glm (logICPMS ~ SITE*ISLAND*ORG*ORGL*CAGE*Metal, data = M.m, na.action = na.exclude) # this removes each interacting factor to give you the MAM model step(m1.0) # MAM model m1.1 = glm(formula = logICPMS ~ ISLAND + ORG + CAGE + Metal + ISLAND:ORG + ISLAND:CAGE + ISLAND:Metal + ORG:Metal, data = M.m, na.action = na.exclude) 
$\endgroup$
2
  • $\begingroup$ Stepwise selection methods are generally considered a bad idea (see here). $\endgroup$ Commented May 20, 2015 at 17:53
  • $\begingroup$ Since making post hoc comparisons is not the usual purpose of stepwise regression, could you please explain how and why this recommendation answers the question? $\endgroup$ Commented May 20, 2015 at 19:23

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.