I am reviewing a paper for a journal. The authors propose a new classification scheme. They then try to assess the classification accuracy using some novel predictor data. Standard classification techniques (SVM, KNN, etc) are used. I came to know of a detail of what they did, via the response document that they gave to my initial set of comments. Hence this message.
Let us say there are two ways to process the predictor data, and derive the prediction metrics (independent variables) associated with the classification. Let us call them M1 and M2 (method1 and method2). The authors first tried both the methods using the entire sample data set, and they concluded that M1 was better. This was done because M1 was giving better results (accuracy stats) with the data. Then, they did their "main analysis" with M1. And in the final paper, they just talk about M1, and they give a logical-sounding reason for selecting M1. That is, they do not mention that they had tested M1 and M2. They derived accuracy stats using cross validation (train/test ratio 80:20). I think that these accuracy stats (eg, the overall accuracy, kappa values) associated with sample/M1 will now not represent that of the population. Am I right? Or is it just the confidence intervals associated with those stats?
I also think it is a version of "p-value hacking / data dregring". But it is more like "methods dredging", where they first use the data to select a "good/better" method. And then in the main paper, they do not mention this step. They (rather) give a logical reason for selecting method M1, and they go on to state the results of using this method.