Xgboost: does the response need to be standardized when using shrinkage (parameters lambda, gamma)?

Question

In this similar question about the implications of standardizing the features of data, the answer is that it is not important. However, (as pointed out in the comments on that post) I am interested in knowing if standardization on the whole data (features and response) is important when using non-zero shrinkage parameters such as lambda, alpha and/or gamma (in the context of the tree-booster method, not the linear one) in the context of regression. What I want to find out is: 1) Are there any distortions if no standardization is used? 2) Does it bring any benefits at all to standardize the data prior to fitting and XGBOOST model with the parameters above?

Here are my thoughts so far:

For example, when using LASSO standardization is very important, since the absolute values of the parameters are summed in the penalty term. Thus, for example, if one feature is on a much larger scale than the response and than another feature, while both contributing the same to reducing the SSR, it might mistakenly be identified as more relevant than the other feature since its parameter usually would be smaller and considered less 'costly'. Clearly, not standardizing is a distortion here. With XGBOOST there are no parameters, so one might believe there is no need for standardization.

However, looking at the formula for the weights from the documentation:

$$ w_j^* = \frac{G_j}{H_j + \lambda} $$

since $G_j$ is the summed up gradient, which is the vector of residuals for squared loss, it should drastically change upon standardization. $H_j$, the sum of the hessian, is just the number of observations for squared loss, which remains the same upon standardization. Therefore, given the formula above, $w_j$ should drastically change upon standardization of the data while keeping the same $\lambda$ & $\gamma$, see also this video. So I expect the optimal parameters $\lambda$ & $\gamma$ to be very different when standardizing or not standardizing (e.g. for large scales I'd expect the differences to be in the 1000s if not more). This doesn't however tell me if standardization brings anything to the table. After some simulations, neither seems to be the case: similar (not identical) results show up for the same parameter values both when not standardizing and when standardizing (and then re-scaling for the prediction). So standardization doesn't seem to matter, however I can't base this on just a simple simulation. Some theoretical arguments are needed.

Your argument about the weights only concerns standardizing the response variable, is that correct? — Ryan Volpi
– Ryan Volpi, Commented Mar 19, 2021 at 15:29
@RyanVolpi: yes, since only the standardization of the response would affect $G_j$ for squared loss. [I initially thought about it in a broader sense: since $w_j$ is used for splitting (it is calculated left and right for each point of each feature) I thought maybe there could be some interaction between stdandardized $y$ and $X$. But now that you mention it, it's clear that standardization of features is irrelevant for the weights (scores) $w_j$, since splits are tried out sequentially and the prediction within a split (=sum of $w_j$ across trees) has nothing to do with the features.] — PaulG
– PaulG, Commented Mar 19, 2021 at 17:41

Ben Reiniger · Accepted Answer · 2023-10-04 14:36:34Z

(At least for MSE loss) The optimal hyperparameters $\lambda, \alpha, \gamma$ depend on the scale of your target, but in a relatively predictable way. If $y'=C y$, then using just one regularization, we have $\hat\lambda'=\hat\lambda$, $\hat\alpha'\approx C\hat\alpha$, or $\hat\gamma'=C^2\hat\gamma$. (When more than one are used, the $\approx$ in $\hat\alpha'$ means the others could change too. I wouldn't imagine it's necessarily better or worse.)

A bit of intuition in response to your last paragraph: the $w_j$ indeed change dramatically on scaling the target variable, but so do the gradients of the loss. You're adding the leaf scores onto the sum so far among earlier trees, and the difference between the target values and that partial sum will have been scaled similarly, so you also would like to see the new leaf scores to be scaled similarly.

To put that more formally:

With loss = $\frac12\sum (p-y)^2$ with $p$ the prediction and $y$ the target, we have $G=\sum (p-y)$ and $H=1$. I claim that, without L1 or L0 regularization, we have $p'=Cp$. Before beginning boosting, an initial prediction is made, generally a constant one, and either $p\equiv0$ or $p\equiv \bar{y}$, and those both scale by $C$. (I seem to remember something about $p\equiv0.5$?...) So, the gradients used for training the first tree are scaled by $C$, and the hessians remain constant; and so the optimal weight at leaf, $G/(H+\lambda)$, is also scaled by $C$, and when added to the base score the total prediction for every row has also been scaled by $C$.

$\alpha$ and $\gamma$ do have an effect though. The optimal weight with $\alpha>0$ turns out to be (I think)

$$-\frac{\operatorname{sgn}(G)\cdot (G-\alpha)^+}{H+\lambda}$$

When we scale $y$ and hence $G$ by $C$ but don't change $\alpha$, the numerator changes significantly. If we scale $\alpha$ by $C$, then the numerator still doesn't strictly scale by $C$, although it will be much closer.

$\gamma$ doesn't affect leaf scores, but determines whether a further split is made. A little later in the tutorial the gain of a split is given; if the gain from a split fails to exceed $\gamma$, then no split will be made. But the gain of a split is a linear combination of the nodes' values of

$$\frac{G^2}{H+\lambda},$$

and so similar to above is scaled by $C^2$, and so a corresponding $\gamma$ threshold needs to also be scaled by $C^2$.

Here's a Colab notebook with some experiments.

(Again, emphasize this is about standardizing the targets; standardizing features does nothing.) It probably doesn't help or hurt, but does change the optimal hyperparameters. Scaling the targets might put the hyperparameters into a more common range across different problems, making the search space more consistent? — Ben Reiniger
– Ben Reiniger, Commented Oct 4, 2023 at 14:38
In the official tutorial (see section "Model Complexity") gamma penalizes the complexity with respect to the number of leaves. In that view, it should be insensitive to the scaling. However, if the section "Learning the tree structure" it reappears, where now it is scale dependent. I am very confused. — Antonios Sarikas
– Antonios Sarikas, Commented Apr 2, 2024 at 20:20
@adosar When the model complexity score gets added to the objective, which does depend on target scale, the optimal tree strikes a different balance between bias and variance depending on those relative scales. More concretely, if I scale the targets up by 10, the unpenalized objective scales by 100; if adding a new leaf can drop that by only 2, it'll happen, whereas in the unscaled setting that split only changes the unpenalized objective by 0.02, and it won't happen. — Ben Reiniger
– Ben Reiniger, Commented Apr 3, 2024 at 2:31
@BenReiniger Thanks! This makes sense. I was looking on the complexity equation $\omega(f)$ alone, forgetting that this is actually added to the loss (obj = loss + reg). — Antonios Sarikas
– Antonios Sarikas, Commented Apr 5, 2024 at 20:56

Stack Exchange Network

Xgboost: does the response need to be standardized when using shrinkage (parameters lambda, gamma)?

1 Answer 1

Linked

Hot Network Questions

Xgboost: does the response need to be standardized when using shrinkage (parameters lambda, gamma)?

1 Answer 1

Linked

Related

Hot Network Questions