4
$\begingroup$

I was reading the book "An Introduction to Statistical Learning with Applications in R". In page 306, when talking about the objective function of tree model, the book says:

"The goal is to find boxes $R_1,...,R_J$" that minimize the RSS, given by" $$\sum_{j=1}^J\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2,$$ where $\hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into $J$ boxes."

My question is: isn't the optimal solution to this RSS very obvious? We just partition the whole feature into $N$ rectangles such that each rectangle only contains one data point, then we achieve zero RSS. Let's put the test performance aside. For now, if we just want to find the $J$ and $\{R_j\}_{j=1}^J$ that minimizes the above RSS, then shouldn't we just make partitions of the feature space such that each rectangle only contains one training data point?

$\endgroup$
1
  • $\begingroup$ The first and second authors of this book are Gareth James and Daniela Witten. Hastie and Tibshirani are third and fourth authors. $\endgroup$ Commented Feb 18, 2019 at 7:35

1 Answer 1

3
$\begingroup$

You're correct that partitioning with a single training point per 'box' would achieve zero error on the training set. But, in the optimization problem Hastie and Tibshirani described, the number of boxes $J$ isn't a free parameter to solve for. Rather, it's a hyperparameter--we can choose its value initially, but must consider it fixed when solving for parameters that define the boxes. If $J$ is set less than the number of data points, then using one box per data point is not a possible solution.

This is a good thing because we typically wouldn't want to end up with one box per data point. The problem with this solution is overfitting--if the data is noisy, perfect training set performance simply indicates that we have fit the noise, and the model may not generalize well to unseen/future data. The number of leaf nodes (i.e. boxes)--and related hyperparameters governing tree size--control the tradeoff between over- and underfitting. They're typically tuned to optimize some measure of expected generalization performance. But, minimizing training set error isn't a valid way to do this.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.