Solve the optimization problem of tree, should we make each rectangle contains exactly one training data point?

Question

I was reading the book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:

"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by" $$\sum_{j=1}^J\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2,$$ where $\hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into $J$ boxes."

My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and $\{R_j\}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?

The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors. — Nick Cox
– Nick Cox, Commented Feb 18, 2019 at 7:35

user20160 · Accepted Answer · 2019-02-18 05:06:56Z

You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.

This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.

Stack Exchange Network

Solve the optimization problem of tree, should we make each rectangle contains exactly one training data point?

1 Answer 1

Hot Network Questions

Solve the optimization problem of tree, should we make each rectangle contains exactly one training data point?

1 Answer 1

Related

Hot Network Questions