13
$\begingroup$

I'm trying to assess performance of a supervised machine learning classification algorithm. The observations fall into nominal classes (2 for the time being, however I'd like to generalize this to multi-class problems), drawn from a population of 99 subjects.

One of the questions I'd like to be able to answer is, if the algorithm exhibits a significant difference in classification accuracy between the input classes. For the binary classification case I am comparing mean accuracy between the classes across subjects using a paired Wilcoxon test (since the underlying distribution is non-normal). In order to generalize this procedure to multi-class problems I inteded to use a Friedman test.

However, the p values obtained by those two procedures in case of a binary IV vary wildly, with the Wilcoxon test yielding p < .001 whereas p = .25 for the Friedman test. This leads me to believe I have a fundamental misunderstanding of the structure of the Friedman test.

Is it not appropriate to use a Friedman test in this case to compare the outcome of the repeated measures of the accuracy across all subjects?

My R code to obtain those results (subject is the subject identifier, acc the accuracy DV and expected the observation class IV):

> head(subject.accuracy, n=10) subject expected acc 1 10 none 0.97826087 2 10 high 0.55319149 3 101 none 1.00000000 4 101 high 0.68085106 5 103 none 0.97826087 6 103 high 1.00000000 7 104 none 1.00000000 8 104 high 0.08510638 9 105 none 0.95121951 10 105 high 1.00000000 > ddply(subject.accuracy, .(expected), summarise, mean.acc = mean(acc), se.acc = sd(acc)/sqrt(length(acc))) expected mean.acc se.acc 1 none 0.9750619 0.00317064 2 high 0.7571259 0.03491149 > wilcox.test(acc ~ expected, subject.accuracy, paired=T) Wilcoxon signed rank test with continuity correction data: acc by expected V = 3125.5, p-value = 0.0003101 alternative hypothesis: true location shift is not equal to 0 > friedman.test(acc ~ expected | subject, subject.accuracy) Friedman rank sum test data: acc and expected and subject Friedman chi-squared = 1.3011, df = 1, p-value = 0.254 
$\endgroup$
1
  • $\begingroup$ I am not sure that your call to wilcox.test does a signed rank test comparing the accuracy under the two conditions, because you never tell it the pairing variable. At the very least this is an unsafe way of running the test, because it relies on the ordering of the rows in the input data. $\endgroup$ Commented Jan 30, 2014 at 20:39

1 Answer 1

16
$\begingroup$

Friedman test is not the extension of Wilcoxon test, so when you have only 2 related samples it is not the same as Wilcoxon signed rank test. The latter accounts for the magnitude of difference within a case (and then ranks it across cases), whereas Friedman only ranks within a case (and never across cases): it is less sensitive.

Friedman is actually almost the extension of sign test. With 2 samples, their p-values are very close, with Friedman being just slightly more conservative (these two tests treat ties in somewhat different ways). This small difference quickly vanishes as the sample size grows. So, for two related samples these two tests are really peer alternatives.

The test which is equivalent to Wilcoxon - in the same sense as Friedman to sign - is not very well known Quade test, mentioned for example here: http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/friedman.htm.

$\endgroup$
0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.