8
$\begingroup$

I have ranked concentrations (from 0 to 5) of 5 different proteins from 3 different locations from the same patient. In total, 15 measures per patient and 61 patients, so 915 observations.

What I'd like to know is if:

  • The 3 locations are the same regarding the concentration of the 5 proteins.
  • Any protein is present in one specific location in higher concentration.

I think I'd need a two-way Friedman's ANOVA since I have 2 categorical variables (protein and location) and 1 ordinal variable (ranked concentration). My questions are:

  1. How can I run the test in R?
  2. Is bootstrapping my only solution?
  3. What about interactions?
  4. I have read about proportional odds. Could it help?

To make it clearer, I here is part of the data:

 Protein Location Concentration Prot1 Loc1 0 Prot1 Loc1 0 Prot1 Loc1 1 Prot1 Loc1 0 Prot1 Loc1 0 Prot1 Loc1 2 Prot1 Loc1 1 Prot1 Loc1 1 Prot1 Loc1 1 Prot1 Loc2 1 Prot1 Loc2 0 Prot1 Loc2 0 Prot1 Loc2 1 Prot1 Loc2 1 Prot1 Loc2 3 Prot1 Loc2 0 Prot1 Loc2 0 Prot1 Loc2 2 Prot1 Loc2 0 Prot1 Loc2 0 Prot1 Loc3 0 Prot1 Loc3 0 Prot1 Loc3 0 Prot1 Loc3 1 Prot1 Loc3 1 Prot1 Loc3 2 Prot1 Loc3 0 Prot1 Loc3 1 Prot1 Loc3 1 Prot1 Loc3 0 Prot1 Loc3 0 Prot2 Loc1 1 Prot2 Loc1 1 Prot2 Loc1 2 Prot2 Loc1 0 Prot2 Loc1 0 Prot2 Loc1 1 Prot2 Loc1 0 Prot2 Loc1 0 Prot2 Loc1 0 Prot2 Loc1 2 Prot2 Loc1 0 Prot2 Loc2 2 Prot2 Loc2 2 Prot2 Loc2 1 Prot2 Loc2 1 Prot2 Loc2 0 Prot2 Loc2 1 Prot2 Loc2 3 Prot2 Loc2 0 Prot2 Loc2 0 Prot2 Loc2 3 Prot2 Loc2 0 Prot2 Loc3 3 Prot2 Loc3 1 Prot2 Loc3 2 Prot2 Loc3 1 Prot2 Loc3 0 Prot2 Loc3 1 Prot2 Loc3 0 Prot2 Loc3 0 Prot2 Loc3 0 Prot2 Loc3 1 Prot2 Loc3 0 Prot3 Loc1 1 Prot3 Loc1 0 Prot3 Loc1 0 Prot3 Loc1 0 Prot3 Loc1 0 Prot3 Loc1 0 Prot3 Loc1 1 Prot3 Loc1 2 Prot3 Loc1 0 Prot3 Loc1 0 Prot3 Loc1 0 Prot3 Loc2 0 Prot3 Loc2 0 Prot3 Loc2 0 Prot3 Loc2 0 Prot3 Loc2 0 Prot3 Loc2 1 Prot3 Loc2 5 Prot3 Loc2 1 Prot3 Loc2 0 Prot3 Loc2 0 Prot3 Loc2 0 Prot3 Loc3 1 Prot3 Loc3 0 Prot3 Loc3 0 Prot3 Loc3 0 Prot3 Loc3 0 Prot3 Loc3 5 Prot3 Loc3 3 Prot3 Loc3 2 Prot3 Loc3 0 Prot3 Loc3 0 Prot3 Loc3 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc1 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc2 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot4 Loc3 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc1 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc2 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 Prot5 Loc3 0 
$\endgroup$

3 Answers 3

12
$\begingroup$

Your data are ordinal ratings, so you need some form of ordinal logistic regression. But I also gather that your data are not independent ("... 15 measures per patient..."), so that needs to be taken into account as well. Thus, the appropriate method here is a mixed effects ordinal logistic regression. In R, mixed effects OLR models can be fit with the ordinal package.

Here is a brief demonstration with your data:

prtdf = read.table(text="Protein Location Concentration Prot1 Loc1 0 ... Prot5 Loc3 0", header=T) 

There are several issues with these data. First, they are not quite balanced (which is not actually a big deal):

with(prtdf, table(Protein, Location)) # Location # Protein Loc1 Loc2 Loc3 # Prot1 9 11 11 # Prot2 11 11 11 # Prot3 11 11 11 # Prot4 11 11 11 # Prot5 11 11 11 

Crucially, they are missing a patient ID indicator. I will make one up, using the assumption that the order within each category is consistent and by patient ID (this may well be totally false in reality, so be forewarned):

prtdf$ID = c(1:9, rep(1:11, times=14)) prtdf = prtdf[,c(4,1:3)] head(prtdf, 10) # ID Protein Location Concentration # 1 1 Prot1 Loc1 0 # 2 2 Prot1 Loc1 0 # 3 3 Prot1 Loc1 1 # 4 4 Prot1 Loc1 0 # 5 5 Prot1 Loc1 0 # 6 6 Prot1 Loc1 2 # 7 7 Prot1 Loc1 1 # 8 8 Prot1 Loc1 1 # 9 9 Prot1 Loc1 1 # 10 1 Prot1 Loc2 1 tail(prtdf, 12) # ID Protein Location Concentration # 152 11 Prot5 Loc2 0 # 153 1 Prot5 Loc3 0 # 154 2 Prot5 Loc3 0 # 155 3 Prot5 Loc3 0 # 156 4 Prot5 Loc3 0 # 157 5 Prot5 Loc3 0 # 158 6 Prot5 Loc3 0 # 159 7 Prot5 Loc3 0 # 160 8 Prot5 Loc3 0 # 161 9 Prot5 Loc3 0 # 162 10 Prot5 Loc3 0 # 163 11 Prot5 Loc3 0 

Next, we need to make sure that ID and Concentration are appropriately categorized as factors. (Note also that you are missing any 4's in Concentration.)

with(prtdf, table(Concentration)) # Concentration # 0 1 2 3 4 5 # 120 26 10 5 0 2 prtdf$Concentration = factor(prtdf$Concentration, levels=0:5, ordered=T) prtdf$ID = factor(prtdf$ID, levels=1:11) 

Now we can try to fit a model:

library(ordinal) mod = clmm(Concentration~Protein*Location+(1|ID), data=prtdf, Hess=T, nAGQ=17) # Warning message: # (1) Hessian is numerically singular: parameters are not uniquely determined # In addition: Absolute convergence criterion was met, but relative criterion was not met 

That crashed. The problem seems to be that all the Concentrations in "Prot4" and "Prot5" are 0:

aggregate(Concentration~Protein*Location, data=prtdf, function(x){ mean(as.numeric(x)) }) # Protein Location Concentration # 1 Prot1 Loc1 1.666667 # 2 Prot2 Loc1 1.636364 # 3 Prot3 Loc1 1.363636 # 4 Prot4 Loc1 1.000000 # 5 Prot5 Loc1 1.000000 # 6 Prot1 Loc2 1.727273 # 7 Prot2 Loc2 2.181818 # 8 Prot3 Loc2 1.636364 # 9 Prot4 Loc2 1.000000 # 10 Prot5 Loc2 1.000000 # 11 Prot1 Loc3 1.545455 # 12 Prot2 Loc3 1.818182 # 13 Prot3 Loc3 2.000000 # 14 Prot4 Loc3 1.000000 # 15 Prot5 Loc3 1.000000 table(prtdf[prtdf$Protein%in%c("Prot4","Prot5"), "Concentration"]) # 0 1 2 3 4 5 # 66 0 0 0 0 0 

We'll simply exclude those levels from the analysis:

mod2 = clmm(Concentration~Protein*Location+(1|ID), data=prtdf, Hess=T, nAGQ=7, subset=!prtdf$Protein%in%c("Prot4","Prot5")) summary(mod2) # Cumulative Link Mixed Model fitted with the adaptive Gauss-Hermite # quadrature approximation with 7 quadrature points # # formula: Concentration ~ Protein * Location + (1 | ID) # data: prtdf # subset: !prtdf$Protein %in% c("Prot4", "Prot5") # # link threshold nobs logLik AIC niter max.grad cond.H # logit flexible 97 -106.48 238.97 760(2283) 1.06e-04 2.0e+02 # # Random effects: # Groups Name Variance Std.Dev. # ID (Intercept) 0.5756 0.7587 # Number of groups: ID 11 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # ProteinProt2 -0.14342 0.85466 -0.168 0.867 # ProteinProt3 -0.96646 0.92526 -1.045 0.296 # LocationLoc2 0.03622 0.86002 0.042 0.966 # LocationLoc3 -0.17634 0.84245 -0.209 0.834 # ProteinProt2:LocationLoc2 0.96548 1.19393 0.809 0.419 # ProteinProt3:LocationLoc2 0.02491 1.31402 0.019 0.985 # ProteinProt2:LocationLoc3 0.57413 1.18357 0.485 0.628 # ProteinProt3:LocationLoc3 1.13724 1.27533 0.892 0.373 # # Threshold coefficients: # Estimate Std. Error z value # 0|1 0.1591 0.6595 0.241 # 1|2 1.6771 0.6904 2.429 # 2|3 2.7789 0.7615 3.649 # 3|5 4.1442 0.9793 4.232 

Now this does return a result, but because your variables are factors (or multilevel categorical variables), the individual level p-values are not of interest. You want to know the significance of the factors as a whole. In particular, I gather you may be interested in knowing if the interaction is significant. We can test that by fitting an additive model (i.e., without the interaction term) and performing a nested model test:

mod2a = clmm(Concentration~Protein+Location+(1|ID), data=prtdf, Hess=T, nAGQ=7, subset=!prtdf$Protein%in%c("Prot4","Prot5")) anova(mod2a, mod2) # Likelihood ratio tests of cumulative link models: # # formula: link: threshold: # mod2a Concentration ~ Protein + Location + (1 | ID) logit flexible # mod2 Concentration ~ Protein * Location + (1 | ID) logit flexible # # no.par AIC logLik LR.stat df Pr(>Chisq) # mod2a 9 233.02 -107.51 # mod2 13 238.97 -106.48 2.053 4 0.726 

The interaction does not appear to be significant for these data. If you also wanted to test the variables in the additive model, that can be conveniently done like so:

drop1(mod2a, test="Chisq") # Single term deletions # # Model: # Concentration ~ Protein + Location + (1 | ID) # Df AIC LRT Pr(>Chi) # <none> 233.02 # Protein 2 232.44 3.4238 0.1805 # Location 2 229.74 0.7212 0.6973 

They are not significant in these data either.

To provide explicit answers to your questions: Although Friedman's test is a one-way test only, ordinal logistic regression is a generalization of the Kruskal-Wallis test, and mixed effects OLR is a generalization of OLR and of Friedman's test. Bootstrapping is unlikely to help you here. Ordinal logistic regression is often called the proportional odds model.

$\endgroup$
6
  • $\begingroup$ I have a follow up question on this. How do you use the predict function with the model fitted with clmm. with clm, I can use 'predict(clmMod, type = 'class') but when I do this for 'predict(clmmMod, type = 'class'), it gives me some error. $\endgroup$ Commented Apr 28, 2019 at 19:55
  • $\begingroup$ Off the top of my head, @Crop89, I don't know. You'd have to read the documentation for that. $\endgroup$ Commented Apr 29, 2019 at 0:49
  • $\begingroup$ @gung-ReinstateMonica, How do you specify a Compound Symmetry covariance matrix in clmm function $\endgroup$ Commented Oct 7, 2020 at 22:15
  • $\begingroup$ @Science11, you don't need to. You just need to specify the random effects correctly. $\endgroup$ Commented Oct 7, 2020 at 23:37
  • $\begingroup$ @gung-ReinstateMonica, Please pardon my ignorance. Can you please elaborate on on your comment, " compound symmetry = correctly specifying random effects" ??. I did not catch this. This is mainly due to my very poor understanding of Random Effects Models. $\endgroup$ Commented Oct 8, 2020 at 15:53
-1
$\begingroup$

You can use simple linear regression with interactions to determine relations of proteins, locations (and their interaction) with concentations. Following is the output from the data that you have provided:

> summary(lm(Concentration~Protein*Location, data=prtdf)) Call: lm(formula = Concentration ~ Protein * Location, data = prtdf) Residuals: Min 1Q Median 3Q Max -1.1818 -0.5455 0.0000 0.0000 4.3636 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.66667 0.27918 2.388 0.0182 * ProteinProt2 -0.03030 0.37645 -0.080 0.9360 ProteinProt3 -0.30303 0.37645 -0.805 0.4221 ProteinProt4 -0.66667 0.37645 -1.771 0.0786 . ProteinProt5 -0.66667 0.37645 -1.771 0.0786 . LocationLoc2 0.06061 0.37645 0.161 0.8723 LocationLoc3 -0.12121 0.37645 -0.322 0.7479 ProteinProt2:LocationLoc2 0.48485 0.51890 0.934 0.3516 ProteinProt3:LocationLoc2 0.21212 0.51890 0.409 0.6833 ProteinProt4:LocationLoc2 -0.06061 0.51890 -0.117 0.9072 ProteinProt5:LocationLoc2 -0.06061 0.51890 -0.117 0.9072 ProteinProt2:LocationLoc3 0.30303 0.51890 0.584 0.5601 ProteinProt3:LocationLoc3 0.75758 0.51890 1.460 0.1464 ProteinProt4:LocationLoc3 0.12121 0.51890 0.234 0.8156 ProteinProt5:LocationLoc3 0.12121 0.51890 0.234 0.8156 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8375 on 148 degrees of freedom Multiple R-squared: 0.2019, Adjusted R-squared: 0.1263 F-statistic: 2.673 on 14 and 148 DF, p-value: 0.001637 

It shows that protein 4 and 5 are in somewhat higher concentration but their is no difference between any location regarding concentration of protein. Also there is no interaction between location and protein, hence no one protein is preferentially concentrated in any one location.

$\endgroup$
6
  • $\begingroup$ Thanks! Sometimes the asnwer is that easy that I try to overcomplicate it. $\endgroup$ Commented Jul 7, 2015 at 18:36
  • $\begingroup$ I know, I just was waiting 1 day to see if somebody give a different asnwer. But I gues it's time. Thanks for the reminder. $\endgroup$ Commented Jul 8, 2015 at 18:47
  • 1
    $\begingroup$ I've been reading and I don't think a linear regression is the solution. I'll be answering my own question. Let me know what you think. $\endgroup$ Commented Jul 10, 2015 at 13:45
  • 6
    $\begingroup$ -1, this is incorrect. There are 2 problems: (1) the response is ordinal, so linear regression should not be used; (2) the data are independent, so the non-independence needs to be addressed. $\endgroup$ Commented Jul 10, 2015 at 14:12
  • $\begingroup$ It's what I thought and I answered my own question. Do you think a Logistic Regression Model would be a valid solution? How can I address the non-independence? $\endgroup$ Commented Jul 10, 2015 at 14:57
-1
$\begingroup$

After reading a little bit more, I think I got the answer to my own question.

Since the response variable is ordinal and the other factors are categorical, I shouldn't use simple linear regression, so I've used Logistic Regression Model.

There are two packages in R that can handle ordinal variables: lrm {rms} and polr {MASS}.

I opted for lrm because it shows the p-value for each stimate, but both results are the same.

Following @rnso's notation (thanks!):

> library (rms) > lrm (Concentration ~ Protein * Location, data = prtdf) Logistic Regression Model lrm(formula = Concentration ~ Protein * Location, data = prtdf) Frequencies of Responses 0 1 2 3 5 120 26 10 5 2 Model Likelihood Discrimination Rank Discrim. Ratio Test Indexes Indexes Obs 163 LR chi2 60.55 R2 0.380 C 0.805 max |deriv| 0.002 d.f. 14 g 5.173 Dxy 0.610 Pr(> chi2) <0.0001 gr 176.462 gamma 0.644 gp 0.111 tau-a 0.263 Brier 0.083 Coef S.E. Wald Z Pr(>|Z|) y>=1 -0.0288 0.5961 -0.05 0.9615 y>=2 -1.4135 0.6168 -2.29 0.0219 y>=3 -2.4530 0.6863 -3.57 0.0004 y>=5 -3.7741 0.9132 -4.13 <0.0001 Protein=Prot2 -0.1834 0.8232 -0.22 0.8237 Protein=Prot3 -0.9596 0.8933 -1.07 0.2827 Protein=Prot4 -10.4962 58.1844 -0.18 0.8568 Protein=Prot5 -10.4962 58.1844 -0.18 0.8568 Location=Loc2 -0.1326 0.8291 -0.16 0.8729 Location=Loc3 -0.3028 0.8180 -0.37 0.7113 Protein=Prot2 * Location=Loc2 1.0297 1.1526 0.89 0.3717 Protein=Prot3 * Location=Loc2 0.1827 1.2607 0.14 0.8848 Protein=Prot4 * Location=Loc2 0.1326 82.2851 0.00 0.9987 Protein=Prot5 * Location=Loc2 0.1326 82.2851 0.00 0.9987 Protein=Prot2 * Location=Loc3 0.6113 1.1413 0.54 0.5923 Protein=Prot3 * Location=Loc3 1.0436 1.2382 0.84 0.3993 Protein=Prot4 * Location=Loc3 0.3028 82.2850 0.00 0.9971 Protein=Prot5 * Location=Loc3 0.3028 82.2850 0.00 0.9971 

So with the truncated data set I gave, no significant differences are found between protein concentration, location, or their combination.

I've used the guide by the Institute for digital research and education, UCLA for "Ordinal Logistic Regression".

Any other suggestion?

$\endgroup$
4
  • $\begingroup$ What is the meaning of coefficients for 'y>=1' etc? $\endgroup$ Commented Jul 10, 2015 at 15:12
  • 2
    $\begingroup$ This is better than a linear model, but the non-independence needs to be taken into account. That said, this may not merit a downvote. $\endgroup$ Commented Jul 16, 2015 at 22:51
  • $\begingroup$ @rnso, those are the cutpoint parameters for the ordinal logistic regression model. $\endgroup$ Commented Jul 16, 2015 at 22:51
  • $\begingroup$ Can you explain it in some more detail, or point to a web reference. $\endgroup$ Commented Jul 17, 2015 at 0:34

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.