Does retraining a model on all available data necessarily yield a better model?

Question

A (simplified) typical workflow in machine learning might be:

Train $m$ models on a training set.
Validate the $m$ models on a validation set to yield the best model with parameters $\theta$.
Retrain the best model on all available data (training and validation) which should yield a model with different parameters $\theta'$.

Isn't it possible that parameters $\theta'$ do not perform so well on unseen real-world data? How do you know that the parameters $\theta'$ (from training on all available data) are better than the parameters $\theta$ (from training on just the training set)?

It's possible, yes, but it is not expected. If your data is a random sample from some population, then retraining on more data is guaranteed to keep the bias of your model the same, but decrease the variance. Since the expected test error is a sum of bias and variance, this decreases the expected test error. — Matthew Drury
– Matthew Drury, Commented Mar 4, 2019 at 23:49

Matthew Drury · Accepted Answer · 2019-03-04 23:53:50Z

This practice derives from an understanding of the bias-variance tradeoff.

Recall that the expected test error can be broken down into three components.

$$ E \left[ \text{Test Error} \right] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

Assuming your data sets are independent random samples from a population, retraining on more data has the following effects:

The bias, being a function of only the structure of your model, stays the same.
The variance decreases (or at worst, stays the same).
The irreducible error stays the same.

All together, the expected test error is decreased with this procedure. It's possible that the error on any given test data set increases, since it's only the expectation that is guaranteed to decrease, but since we don't know anything about the particular population samples we are going to expose to the model in production, lowering the expected test error is a good strategy.

Stack Exchange Network

Does retraining a model on all available data necessarily yield a better model?

1 Answer 1

Linked

Hot Network Questions

Does retraining a model on all available data necessarily yield a better model?

1 Answer 1

Linked

Related

Hot Network Questions