In this similar question about the implications of standardizing the features of data, the answer is that it is not important. However, (as pointed out in the comments on that post) I am interested in knowing if standardization on the whole data (features and response) is important when using non-zero shrinkage parameters such as lambda, alpha and/or gamma (in the context of the tree-booster method, not the linear one) in the context of regression. What I want to find out is: 1) Are there any distortions if no standardization is used? 2) Does it bring any benefits at all to standardize the data prior to fitting and XGBOOST model with the parameters above?
Here are my thoughts so far:
For example, when using LASSO standardization is very important, since the absolute values of the parameters are summed in the penalty term. Thus, for example, if one feature is on a much larger scale than the response and than another feature, while both contributing the same to reducing the SSR, it might mistakenly be identified as more relevant than the other feature since its parameter usually would be smaller and considered less 'costly'. Clearly, not standardizing is a distortion here. With XGBOOST there are no parameters, so one might believe there is no need for standardization.
However, looking at the formula for the weights from the documentation:
$$ w_j^* = \frac{G_j}{H_j + \lambda} $$
since $G_j$ is the summed up gradient, which is the vector of residuals for squared loss, it should drastically change upon standardization. $H_j$, the sum of the hessian, is just the number of observations for squared loss, which remains the same upon standardization. Therefore, given the formula above, $w_j$ should drastically change upon standardization of the data while keeping the same $\lambda$ & $\gamma$, see also this video. So I expect the optimal parameters $\lambda$ & $\gamma$ to be very different when standardizing or not standardizing (e.g. for large scales I'd expect the differences to be in the 1000s if not more). This doesn't however tell me if standardization brings anything to the table. After some simulations, neither seems to be the case: similar (not identical) results show up for the same parameter values both when not standardizing and when standardizing (and then re-scaling for the prediction). So standardization doesn't seem to matter, however I can't base this on just a simple simulation. Some theoretical arguments are needed.