Fine-tuning of methods before main analysis: p-value hacking?

Question

I am reviewing a paper for a journal. The authors propose a new classification scheme. They then try to assess the classification accuracy using some novel predictor data. Standard classification techniques (SVM, KNN, etc) are used. I came to know of a detail of what they did, via the response document that they gave to my initial set of comments. Hence this message.

Let us say there are two ways to process the predictor data, and derive the prediction metrics (independent variables) associated with the classification. Let us call them M1 and M2 (method1 and method2). The authors first tried both the methods using the entire sample data set, and they concluded that M1 was better. This was done because M1 was giving better results (accuracy stats) with the data. Then, they did their "main analysis" with M1. And in the final paper, they just talk about M1, and they give a logical-sounding reason for selecting M1. That is, they do not mention that they had tested M1 and M2. They derived accuracy stats using cross validation (train/test ratio 80:20). I think that these accuracy stats (eg, the overall accuracy, kappa values) associated with sample/M1 will now not represent that of the population. Am I right? Or is it just the confidence intervals associated with those stats?

I also think it is a version of "p-value hacking / data dregring". But it is more like "methods dredging", where they first use the data to select a "good/better" method. And then in the main paper, they do not mention this step. They (rather) give a logical reason for selecting method M1, and they go on to state the results of using this method.

100 % p-hacking / data dredging / data torturing, or you could more eloquently put it that they went down the "garden of forking paths". — user2974951
– user2974951, Commented Feb 9, 2023 at 12:50

Stephan Kolassa · Accepted Answer · 2023-02-09 17:49:46Z

6

This is a textbook case of HARKing: Hypothesizing After Results are Known (Kerr, 1998). Not disclosing this step is indeed more than borderline unethical. The authors should prominently note that they did this preprocessing step, because it makes all their results less reliable.

Essentially, all statistics (i.e., parameters estimated from data, like the accuracy of a classifier or many other parameters) are random variables. Evaluating statistics across multiple methods, models or samples and then cherry-picking the largest or most significant ones means that we tend to pick the ones that just randomly in this particular dataset turned out to be large. It may well be that in a similar new dataset, this statistic is lower and another one higher, simply due to sampling variability. Any estimate driven by this kind of cherry-picking will be biased, i.e., it will systematically over-estimate the true parameter value. And confidence intervals that are calculated without taking this additional "snooping" step into account will be too low, i.e., they will have lower than their nominal coverage. Correcting for this "path through the garden of forking paths" (take a look at Gelman's paper) is highly nontrivial.

(Also, accuracy is a seriously flawed evaluation measure.)

edited Feb 9, 2023 at 17:49

answered Feb 9, 2023 at 12:54

Stephan Kolassa

138k22 gold badges277 silver badges544 bronze badges

1

$\begingroup$ Thanks! Can you please also answer this specific part? "I think that these accuracy stats (eg, the overall accuracy, kappa values) associated with sample/M1 will now not represent that of the population. Am I right? Or is it just the confidence intervals associated with those stats?" Please edit your answer adding this info. It would help me frame my response to the authors better. And in general, I want to know the "quantitative errors" that their (incorrect) methods would introduce. $\endgroup$

user7831861
– user7831861

2023-02-09 13:48:10 +00:00
Commented Feb 9, 2023 at 13:48
1

$\begingroup$ I edited my answer. We can't really say how large their error is in quantitative terms, just that it will bias their reported parameter estimates high, and yield over-confident CIs. $\endgroup$

Stephan Kolassa
– Stephan Kolassa

2023-02-09 17:50:31 +00:00
Commented Feb 9, 2023 at 17:50
2

$\begingroup$ I think it is way too strong to insinuate this is unethical. You don't have a full account of everything that was done or why and, IMHO, there are reasonable interpretations that make this a perfectly legitimate procedure. For instance, maybe the initial examination made it patently obvious that one procedure was superior to the other and, for the sake of illustrating some concepts or methods, that was followed up by a closer examination of the properties of that method on these sample data. Regardless, if the authors are clear about what they did, where is the ethical fault?? $\endgroup$

whuber
– whuber ♦

2023-02-09 18:44:55 +00:00
Commented Feb 9, 2023 at 18:44
1

$\begingroup$ @whuber: the OP is clear that the authors did not mention this exploratory analysis, but that OP learned about it through their response document to OP's earlier comments. I'm all aboard with this possibly being a legitimate strategy of analysis. What I am emphatically not cool with is not mentioning it in the paper itself, but only in the response to reviewers. ... $\endgroup$

Stephan Kolassa
– Stephan Kolassa

2023-02-09 21:34:29 +00:00
Commented Feb 9, 2023 at 21:34
1

$\begingroup$ ... And yes, I stand by my point (not an insinuation) that this is unethical. Yes, there may be no malice involved, we all know fields where people routinely do this because they don't know it's problematic. To which I answer that scientists have an obligation to understand good scientific practice. The idea of HARKing has been around for 25 years, and at some point not knowing what this practice does to your statistics is getting hard to excuse. $\endgroup$

Stephan Kolassa
– Stephan Kolassa

2023-02-09 21:35:49 +00:00
Commented Feb 9, 2023 at 21:35

| Show 2 more comments

Stack Exchange Network

Fine-tuning of methods before main analysis: p-value hacking?

1 Answer 1

Linked

Hot Network Questions

Fine-tuning of methods before main analysis: p-value hacking?

1 Answer 1

Linked

Related

Hot Network Questions