How to stop gradient boosting machine from overfitting?

Question

I am comparing a few models (gradient boosting machine, random forest, logistic regression, SVM, multilayer perceptron, and keras neural network) on a multiclassification problem. I have used nested cross validation and grid search on my models, running these on my actual data and also randomised data to check for overfitting. However, for the gradient boosting machine I am finding, no matter how I change my data or model parameters, it is giving me 100% accuracy on the random data every time. Is there something in my code that could be causing this?

Here is my code:

dataset= pd.read_csv('data.csv') data = dataset.drop(["gene"],1) df = data.iloc[:,0:26] df = df.fillna(0) X = MinMaxScaler().fit_transform(df) le = preprocessing.LabelEncoder() encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"]) Y = le.fit_transform(data["category"]) sm = SMOTE(random_state=100) X_res, y_res = sm.fit_resample(X, Y) seed = 7 logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto') LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[100, 200, 500, 1000]} rfc =RandomForestClassifier(n_estimators=500) param_grid = {"max_depth": [3], "max_features": ["auto"], "min_samples_split": [2], "min_samples_leaf": [1], "bootstrap": [False], "criterion": ["entropy", "gini"]} mlp = MLPClassifier(random_state=seed) parameter_space = {'hidden_layer_sizes': [(50,50,50)], 'activation': ['relu'], 'solver': ['adam'], 'max_iter': [10000], 'alpha': [0.0001], 'learning_rate': ['constant']} gbm = GradientBoostingClassifier() param = {"loss":["deviance"], "learning_rate": [0.001], "min_samples_split": [2], "min_samples_leaf": [1], "max_depth":[3], "max_features":["auto"], "criterion": ["friedman_mse"], "n_estimators":[50] } svm = SVC(gamma="scale") tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)} inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed) outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed) def baseline_model(): model = Sequential() model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) #dense layers perform: output = activation(dot(input, kernel) + bias). model.add(Dropout(0.5)) model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel) model.add(Dense(4, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model models = [] models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1))) models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1))) models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1))) models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1))) models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1))) models.append(('Keras', KerasClassifier(build_fn=baseline_model, epochs=100, batch_size=50, verbose=0))) results = [] names = [] scoring = 'accuracy' X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0) for name, model in models: nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring) results.append(nested_cv_results) names.append(name) msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100) print(msg) model.fit(X_train, Y_train) print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100), '%')

Output:

Nested CV Accuracy GBM: 90.952381 (+/- 2.776644 ) Test set accuracy: 90.48 % Nested CV Accuracy RFC: 79.285714 (+/- 5.112122 ) Test set accuracy: 75.00 % Nested CV Accuracy LR: 91.904762 (+/- 4.416009 ) Test set accuracy: 92.86 % Nested CV Accuracy SVM: 94.285714 (+/- 3.563483 ) Test set accuracy: 96.43 % Nested CV Accuracy MLP: 91.428571 (+/- 4.012452 ) Test set accuracy: 92.86 %

Random data code:

ran = np.random.randint(4, size=161) random = np.random.normal(500, 100, size=(161,161)) rand = np.column_stack((random, ran)) print(rand.shape) X1 = rand[:161] Y1 = rand[:,-1] print("Random data counts of label '1': {}".format(sum(ran==1))) print("Random data counts of label '0': {}".format(sum(ran==0))) print("Random data counts of label '2': {}".format(sum(ran==2))) print("Random data counts of label '3': {}".format(sum(ran==3))) for name, model in models: cv_results = model_selection.cross_val_score(model, X1, Y1, cv=outer_cv, scoring=scoring) names.append(name) msg = "Random data CV %s: %f (+/- %f)" % (name, cv_results.mean()*100, cv_results.std()*100) print(msg)

Random data output:

Random data CV GBM: 100.000000 (+/- 0.000000) Random data CV RFC: 62.941176 (+/- 15.306485) Random data CV LR: 23.566176 (+/- 6.546699) Random data CV SVM: 22.352941 (+/- 6.331220) Random data CV MLP: 23.639706 (+/- 7.371392) Random data CV Keras: 22.352941 (+/- 8.896451)

This gradient boosting classifier (GBM) is at 100% whether I reduce the number of features, change the parameters in the grid search (I do put in multiple parameters however this can run for hours for me without results so I have left that problem for now), and is also the same if I try binary classification data.

The random forest (RFC) is also higher at 62%, is there something I am doing wrong?

The data I am using is predominantly binary features, as an example looking like this (and predicting the category column):

gene Tissue Druggable Eigenvalue CADDvalue Catalogpresence Category ACE 1 1 1 0 1 Certain ABO 1 0 0 0 0 Likely TP53 1 1 0 0 0 Possible

Any guidance would be appreciated.

Maybe an interesting related question: stats.stackexchange.com/questions/372676/… — Eskapp
– Eskapp, Commented Apr 9, 2019 at 13:55
Hi thank you for sharing this, I will look into this further as I am a beginner, but at first glance am I right in thinking this implies my model might be working with unlimited depth somehow? — DN1
– DN1, Commented Apr 9, 2019 at 14:03
You may be interested in the following Does gradient boosting overfit where I focused on the impact of modifying the number of estimators in the gradient boosting algorithm. — RUser4512
– RUser4512, Commented Dec 30, 2021 at 9:27

sonia · Accepted Answer · 2019-04-10 13:17:07Z

In general, there are a few parameters you can play with to reduce overfitting. The easiest to conceptually understand is to increase min_samples_split and min_samples_leaf. Setting higher values for these will not allow the model to memorize how to correctly identify a single piece of data or very small groups of data. For a large data set (~1 mil rows), I would place these values at around 50 if not higher. You can do a a grid search to find values that work well for your specific data.

You can also use subsample to reduce overfitting as well as max_features. These parameters basically don't let your model look at some of the data which prevents it from memorizing it.

Thank you so much, this answer is very clear and has put it in perspective for me. Trying gradient boosting and random forest again with increased splits and leaves has reduced the accuracy down to the 20% range similar to the other models. Thank you!

Collectives™ on Stack Overflow

How to stop gradient boosting machine from overfitting?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related