Best Practices for Splitting Data in a Repeated Measures Classification Problem

Question

I am working on a classification problem involving repeated measures. My objective is to classify positive patients as early as possible. In my practical application scenario, once the target becomes 1 for a patient, further model predictions are irrelevant for that patient. On the modeling side, features might react with a delay to the target, so I cannot simply keep only the first row when the target=1 appears for each patient. The data is structured as follows:

PatientID	Time	Target	FeatureA
A	2020-01	0	...
A	2020-02	0	...
A	2020-03	1	...
B	2019-12	0	...
B	2020-01	0	...
B	2020-02	1	...
B	2020-03	1	...
B	2020-04	1	...

I am seeking best practices for splitting the dataset to prevent data leakage. Here is what I have done so far:

Patient-Level Split: I split the data at the patient level to ensure that each patient is either in the training set or the test set, but never in both. For cross-validation, I use GroupKFold to maintain this separation.
Temporal Split Concerns: I am considering a temporal split, but I am worried about potential data leakage. If a patient appears in both the training and test sets, the model might memorise that the patient’s target becomes 1. For example, if patient B's data up to 2020-02 is in the training set and data from 2020-03 onwards is in the test set, the autocorrelation in the patient's features might lead to leakage as the model have already learned from similar feature values on 2020-02 and will most likely correctly classify 2020-03 onwards. In an ideal scenario, the model would only learn from other similar patients.

Questions:

Do my concerns make sense?
Is there a way to implement a temporal split without introducing leakage when the same patient appears in both sets and features are autocorrelated? Would this problem be solved if I control feature autocorrelation?
Are there any alternative cross-validation strategies that could better suit this problem?
Are there any specific models that deal with this type of data? Does it make sense to use time series methods?

Thank you!

cbeleites · Accepted Answer · 2024-07-02 12:33:42Z

Both are legitimate concerns about data leakage.

(Side note: and it is not that rare to have a situation where multiple factors need to be taken into account for independent splitting, e.g. crossed random factors lead to such situations as well)

Is there a way to implement a temporal split without introducing leakage when the same patient appears in both sets and features are autocorrelated?

If the same patient appears in both testing and training, there is leakage.

But you can further split into testing and training patients. Train on data that is marked training on both factors, test with data that is marked test on both factors. Meaning that some combinations of patient x time are neither in the training nor in the tests for a particular surrogate model:

Here's an illustration with two (non temporal) crossed random factors, cell line and medium - a similar situation up to the point where you have to take into account the specialities of temporal vs. nominal factor splitting.

So for each combination of test time x test patients (dark blue in the image), you train only on data that is neither test time nor test patient (light blue). Data that is either test patient and training time or test time and training patient (red) does not enter the training and testing of this surrogate model.

Now, this splitting procedure can be applied e.g. with cross validation for patients and whatever temporal splitting design is adequate. Or, in my example as a (fully) crossed design of 3-fold cell-line CV x 3-fold medium CV, yielding a total of 9 surrogate models.

Now, there is nothing that keeps this to a single-split design, it can easily be applied with a k-fold patient split and whatever

Would this problem be solved if I control feature autocorrelation?

No in the sense that testing/validation must take the data as they are in the real world. Whether or not you need to guard against data-leakage for these factors is determined by your application scenario. And for that you already said that repeated measurements of patients may be more correlated than to measurements of other patients. And there may be a systematic development over time in the measurements.

You may find retrospectively that the temporal or patient factor can be dropped from your considerations.

Are there any specific models that deal with this type of data? Does it make sense to use time series methods?

I think it would be beneficial to think hard how to best set up the modeling, yes (e.g. the application sounds to me as if it were good to take earlier measurements into account, but not to require them; you may also think whether a custom loss function that penalizes late diagnosis would help, and you may specifically try models that are not influenced too much by data points that are already deep inside the diagnosed space), but I'll leave this to others for a real answer.

It is unusually for data splitting to work well enough when the sample size is not huge (e.g., 20,000 subjects) because the method is volatile, i.e., too dependent on the luck of the split. Resampling techniques are far better, e.g. the bootstrap or 100 repeats of 10-fold cross-validation. More here. — Frank Harrell
– Frank Harrell, Commented Jul 2, 2024 at 11:28
@FrankHarrell How would you split the data for the cross-validation? — Dave
– Dave, Commented Jul 2, 2024 at 11:46
@FrankHarrell: You're barking up the wrong tree - please see the edit. I hope it is clear now that I do recommend using resampling, even though I restricted my explanations to the conditions on the splitting in the narrower sense. I.e., as it needs to be done for each of the instances (folds, subsamples, ...) that the resampling yields for time and patients. Generally, once you know how to split properly once, it's rather straightforward to do that inside the resampling of your choice. — cbeleites
– cbeleites, Commented Jul 2, 2024 at 12:37
@Dave: what I've described applies to each single surrogate model of CV just as for a single split: for each of the temporal splits: for each of the k1 patients: determine train & test as described, train, predict. (fully crossed temporal x patient CV design, you may also decide to repeat, or to thin out.) — cbeleites
– cbeleites, Commented Jul 2, 2024 at 12:42
@cbeleitesunhappywithSX thank you for your well written considerations! Interesting CV methodology, is there a name in the literature for that? — jpsca1293
– jpsca1293, Commented Jul 3, 2024 at 8:15

Stack Exchange Network

Best Practices for Splitting Data in a Repeated Measures Classification Problem

Questions:

1 Answer 1

Hot Network Questions

Best Practices for Splitting Data in a Repeated Measures Classification Problem

Questions:

1 Answer 1

Related

Hot Network Questions