0
$\begingroup$

I've been fitting a binary logistic extreme gradient boosted model using different random samples of the data as training and computing the Gini index (coefficient). However, as I increase the proportion of data used for training, the Gini index increases, and vice versa. I've tried different random seeds and the result is consistent (small variations). Some examples of the Gini index and training proportion (TP) used:

TP 10%: Gini 0.004 TP 60%: Gini 0.243 TP 80%: Gini 0.288 TP 90%: Gini 0.309 TP 100%: Gini 0.320 

(Note that for binary classification, Gini 0.5 indicates the worst predictive performance possible).

This seems counterintuitive to me. More training data should in general lead to better predictions. For what its worth, putting the question in Google, AI Overview says "The Gini index, a measure of inequality, generally decreases as the proportion of training data increases, not the other way around. This is because a larger training set provides a more accurate representation of the underlying distribution, leading to a more robust and stable estimate of inequality..."

The Gini index behaves as expected otherwise (decreases if more predictors are included in the model, etc.) What else might explain this odd behaviour?

$\endgroup$
6
  • $\begingroup$ Are you measuring the Gini on the training data? Or a test set? Or something else? $\endgroup$ Commented Jul 23 at 0:58
  • $\begingroup$ Great question. I've tried measuring on the training data, test set, and both training and test set combined (full data). It is less of a problem in the latter two cases, but still noticeable. I've tried with other models/datasets and they don't have this issue - only this data/model. But results above were for training data only. $\endgroup$ Commented Jul 23 at 1:27
  • $\begingroup$ medium.com/@kstarun/…, when talking about $Gini = 2\times AUC -1$, says "In the context of credit risk modeling, a higher Gini coefficient indicates better model performance in terms of its ability to accurately rank borrowers based on their creditworthiness." $\endgroup$ Commented Jul 23 at 11:18
  • 1
    $\begingroup$ @Henry That link also says "[This Gini index] should not be confused with the traditional derivation of Gini coefficient from the Lorenz curve, which is used to measure inequality in a distribution". That latter is what I'm using. $\endgroup$ Commented Jul 24 at 3:20
  • 1
    $\begingroup$ @Henry The usage of "Gini index" in this post is entirely standard in its usage with fitting decision trees. en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity $\endgroup$ Commented Jul 26 at 17:29

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.