Split-sample analysis as a way of avoiding p-hacking?

Question

In my lab we pre-register our hypotheses prior to running an experiment, and we also attempt to fully specify our exclusion criteria, coding practices for small categories and edge cases, etc. However, real-world data is frequently messy, corrupted, fraudulent, or clustered in ways we didn't anticipate. We also sometimes make mistakes and realize that our planned tests were either misaligned with our hypotheses, or impossible to run as specified. In these cases we have to make decisions about how to change our approach, but that leaves a lot of room for unintentional p-hacking.

Recently a colleague suggested that we could use a split-sample approach similar to the process to avoid overfitting a model: Do all the data processing and analysis on a relatively small subset of our sample, making any judgement calls or changes that we need to. Then, run the code unchanged on the full sample. That way even if we did make biased decisions, they . Can anyone point to any literature on the validity of this approach or guidelines for doing it?

Edits: To be clear, I'm not talking about using a split sample to choose among different models based on their performance. I'm talking about decisions on how to clean data or discover outright errors in the way we planned our model -- eg, a certain parameter is not estimable because two predictors are aliased or highly correlated in ways we didn't realize ahead of time.

Also, this wouldn't be a replacement for other good practices, such as thinking as hard as possible before pre-registering, making data processing decisions before looking at any outcomes, and checking our decisions with outside experts. It would just be an additional layer of safety.

--

* Since this is a more qualitative approach than standard model training it's not obvious to me that we need to exclude the "training" data from our final analysis and use only on the "test" data, but I'm open to learning more.

Consider using the data as evidence that you can make reasoned inferences from rather than distilling it to an enforced decision. The evidential meaning of the data are not affected by pre-registration even if the global error rate properties of hypothesis tests are. See this explanation of local evidence and global error meanings of p-values: link.springer.com/chapter/10.1007/164_2019_286 — Michael Lew
– Michael Lew, Commented Oct 6, 2023 at 21:05

Frank Harrell · Accepted Answer · 2023-10-06 12:34:28Z

As explained here there are many disadvantages of holdout sample validation. It is an unbiased but high variance procedure. I’ve run an example where n=17,000 with a binary outcome was not a large enough sample size for the process to be stable, i.e., if I split multiple times I get substantively different models and different performance levels. Resampling (bootstrap or 100 repeats of 10-fold cross-validation) is much lower variance and fully exposes the difficulty of any variable selection (not recommended) procedure. On the other hand, holdout samples provide essentially an example validation of an example model. Resampling on the other hand exposes the extreme volatility of trying to use the data to select which elements of the data are important. Resampling constitutes strong internal validation when all supervised learning steps are repeated afresh inside each repetition loop. It validates the modeling process and while doing so provides the lowest mean square error estimation of likely future performance of a model developed using that process.

Thanks! If I understand correctly, what you're talking about might actually be a positive feature of this process for us. We're not trying to use this for model selection. We're trying to use it to give us a chance to work with real data and make reasonable choices about how to deal with ambiguities and edge cases, without biasing our choices towards making the results look "good." Your link suggests that even if we were using this process as an opportunity to p-hack, it might not be effective! — octern
– octern, Commented Oct 6, 2023 at 14:49
Good. Said another way, rigorous resampling procedures document damage from p-hacking and overfitting. They don’t fix the problems but rather quantify the extent of the problem and the impact on predictive accuracy. P.S. Thanks is expressed on the site by upvoting or marking something as an answer. — Frank Harrell
– Frank Harrell, Commented Oct 6, 2023 at 18:49

Stack Exchange Network

Split-sample analysis as a way of avoiding p-hacking?

1 Answer 1

Hot Network Questions

Split-sample analysis as a way of avoiding p-hacking?

1 Answer 1

Related

Hot Network Questions