In my lab we pre-register our hypotheses prior to running an experiment, and we also attempt to fully specify our exclusion criteria, coding practices for small categories and edge cases, etc. However, real-world data is frequently messy, corrupted, fraudulent, or clustered in ways we didn't anticipate. We also sometimes make mistakes and realize that our planned tests were either misaligned with our hypotheses, or impossible to run as specified. In these cases we have to make decisions about how to change our approach, but that leaves a lot of room for unintentional p-hacking.
Recently a colleague suggested that we could use a split-sample approach similar to the process to avoid overfitting a model: Do all the data processing and analysis on a relatively small subset of our sample, making any judgement calls or changes that we need to. Then, run the code unchanged on the full sample. That way even if we did make biased decisions, they . Can anyone point to any literature on the validity of this approach or guidelines for doing it?
Edits: To be clear, I'm not talking about using a split sample to choose among different models based on their performance. I'm talking about decisions on how to clean data or discover outright errors in the way we planned our model -- eg, a certain parameter is not estimable because two predictors are aliased or highly correlated in ways we didn't realize ahead of time.
Also, this wouldn't be a replacement for other good practices, such as thinking as hard as possible before pre-registering, making data processing decisions before looking at any outcomes, and checking our decisions with outside experts. It would just be an additional layer of safety.
--
* Since this is a more qualitative approach than standard model training it's not obvious to me that we need to exclude the "training" data from our final analysis and use only on the "test" data, but I'm open to learning more.