I'm trying to assess performance of a supervised machine learning classification algorithm. The observations fall into nominal classes (2 for the time being, however I'd like to generalize this to multi-class problems), drawn from a population of 99 subjects.
One of the questions I'd like to be able to answer is, if the algorithm exhibits a significant difference in classification accuracy between the input classes. For the binary classification case I am comparing mean accuracy between the classes across subjects using a paired Wilcoxon test (since the underlying distribution is non-normal). In order to generalize this procedure to multi-class problems I inteded to use a Friedman test.
However, the p values obtained by those two procedures in case of a binary IV vary wildly, with the Wilcoxon test yielding p < .001 whereas p = .25 for the Friedman test. This leads me to believe I have a fundamental misunderstanding of the structure of the Friedman test.
Is it not appropriate to use a Friedman test in this case to compare the outcome of the repeated measures of the accuracy across all subjects?
My R code to obtain those results (subject is the subject identifier, acc the accuracy DV and expected the observation class IV):
> head(subject.accuracy, n=10) subject expected acc 1 10 none 0.97826087 2 10 high 0.55319149 3 101 none 1.00000000 4 101 high 0.68085106 5 103 none 0.97826087 6 103 high 1.00000000 7 104 none 1.00000000 8 104 high 0.08510638 9 105 none 0.95121951 10 105 high 1.00000000 > ddply(subject.accuracy, .(expected), summarise, mean.acc = mean(acc), se.acc = sd(acc)/sqrt(length(acc))) expected mean.acc se.acc 1 none 0.9750619 0.00317064 2 high 0.7571259 0.03491149 > wilcox.test(acc ~ expected, subject.accuracy, paired=T) Wilcoxon signed rank test with continuity correction data: acc by expected V = 3125.5, p-value = 0.0003101 alternative hypothesis: true location shift is not equal to 0 > friedman.test(acc ~ expected | subject, subject.accuracy) Friedman rank sum test data: acc and expected and subject Friedman chi-squared = 1.3011, df = 1, p-value = 0.254
wilcox.testdoes a signed rank test comparing the accuracy under the two conditions, because you never tell it the pairing variable. At the very least this is an unsafe way of running the test, because it relies on the ordering of the rows in the input data. $\endgroup$