5
$\begingroup$

I'm using an early stopping for XGBClassifier. The fitting looks like this (simplified):

# X_train, y_train, X_test, y_test - data split model = XGBClassifier(early_stopping_rounds=10, eval_metric="logloss") model.fit( X_train, y_train, eval_set = [(X_train, y_train), (X_test, y_test)] ) 

As you can see, the test dataset is used for early stopping. Can this be interpreted as data leakage? In my opinion it's not, since there's no direct information transfer from outside the training set to the fitting procedure and evaluation on the test set may only cause the fitting procedure stops too early/too late/just in time. But I'm not an expert in XGBoost training and I'm not sure if I'm correct.

I read this related topic LightGBM eval_set - what to do when I fit the final model (there's no test data left) but it's not exactly answering my question.

$\endgroup$

1 Answer 1

5
$\begingroup$

This is a form of data leakage, since the resulting model is influenced by the test set. In this case, the test set is determining how complex the final model is (how many boosting steps occur), at least partially.

One would typically use a validation set for early stopping.

[...] there's no direct information transfer from outside the training set to the fitting procedure and evaluation on the test set may only cause the fitting procedure stops too early/too late/just in time.

You're right that the influence is not direct, but it is still present. The model's stopping point has been tuned to the characteristics of the test set (i.e. it asks 'where should I stop training, based on the test set').

You have tuned an aspect of the model to the test set, so the model has effectively seen the test set now.

The test score should tell you how the model performs on entirely unseen data. This keeps the score realistic (unbiased) rather than misleadingly optimistic.

If you have too little data to split into meaningfully-sized training and validation sets, try cross-validation:

  • Put some data aside for the test set (only use for the final/deployment model)
  • Cross-validation using the remaining data to determine the number of boosting rounds. Perhaps use the average model.best_iteration_ over the folds, or run a search for the optimal number of boosting rounds.
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.