I am working on a classification problem involving repeated measures. My objective is to classify positive patients as early as possible. In my practical application scenario, once the target becomes 1 for a patient, further model predictions are irrelevant for that patient. On the modeling side, features might react with a delay to the target, so I cannot simply keep only the first row when the target=1 appears for each patient. The data is structured as follows:
| PatientID | Time | Target | FeatureA |
|---|---|---|---|
| A | 2020-01 | 0 | ... |
| A | 2020-02 | 0 | ... |
| A | 2020-03 | 1 | ... |
| B | 2019-12 | 0 | ... |
| B | 2020-01 | 0 | ... |
| B | 2020-02 | 1 | ... |
| B | 2020-03 | 1 | ... |
| B | 2020-04 | 1 | ... |
I am seeking best practices for splitting the dataset to prevent data leakage. Here is what I have done so far:
Patient-Level Split: I split the data at the patient level to ensure that each patient is either in the training set or the test set, but never in both. For cross-validation, I use
GroupKFoldto maintain this separation.Temporal Split Concerns: I am considering a temporal split, but I am worried about potential data leakage. If a patient appears in both the training and test sets, the model might memorise that the patient’s target becomes 1. For example, if patient B's data up to 2020-02 is in the training set and data from 2020-03 onwards is in the test set, the autocorrelation in the patient's features might lead to leakage as the model have already learned from similar feature values on 2020-02 and will most likely correctly classify 2020-03 onwards. In an ideal scenario, the model would only learn from other similar patients.
Questions:
- Do my concerns make sense?
- Is there a way to implement a temporal split without introducing leakage when the same patient appears in both sets and features are autocorrelated? Would this problem be solved if I control feature autocorrelation?
- Are there any alternative cross-validation strategies that could better suit this problem?
- Are there any specific models that deal with this type of data? Does it make sense to use time series methods?
Thank you!
