I am comparing a few models (gradient boosting machine, random forest, logistic regression, SVM, multilayer perceptron, and keras neural network) on a multiclassification problem. I have used nested cross validation and grid search on my models, running these on my actual data and also randomised data to check for overfitting. However, for the gradient boosting machine I am finding, no matter how I change my data or model parameters, it is giving me 100% accuracy on the random data every time. Is there something in my code that could be causing this?
Here is my code:
dataset= pd.read_csv('data.csv') data = dataset.drop(["gene"],1) df = data.iloc[:,0:26] df = df.fillna(0) X = MinMaxScaler().fit_transform(df) le = preprocessing.LabelEncoder() encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"]) Y = le.fit_transform(data["category"]) sm = SMOTE(random_state=100) X_res, y_res = sm.fit_resample(X, Y) seed = 7 logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto') LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[100, 200, 500, 1000]} rfc =RandomForestClassifier(n_estimators=500) param_grid = {"max_depth": [3], "max_features": ["auto"], "min_samples_split": [2], "min_samples_leaf": [1], "bootstrap": [False], "criterion": ["entropy", "gini"]} mlp = MLPClassifier(random_state=seed) parameter_space = {'hidden_layer_sizes': [(50,50,50)], 'activation': ['relu'], 'solver': ['adam'], 'max_iter': [10000], 'alpha': [0.0001], 'learning_rate': ['constant']} gbm = GradientBoostingClassifier() param = {"loss":["deviance"], "learning_rate": [0.001], "min_samples_split": [2], "min_samples_leaf": [1], "max_depth":[3], "max_features":["auto"], "criterion": ["friedman_mse"], "n_estimators":[50] } svm = SVC(gamma="scale") tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)} inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed) outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed) def baseline_model(): model = Sequential() model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) #dense layers perform: output = activation(dot(input, kernel) + bias). model.add(Dropout(0.5)) model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel) model.add(Dense(4, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model models = [] models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1))) models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1))) models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1))) models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1))) models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1))) models.append(('Keras', KerasClassifier(build_fn=baseline_model, epochs=100, batch_size=50, verbose=0))) results = [] names = [] scoring = 'accuracy' X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0) for name, model in models: nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring) results.append(nested_cv_results) names.append(name) msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100) print(msg) model.fit(X_train, Y_train) print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100), '%') Output:
Nested CV Accuracy GBM: 90.952381 (+/- 2.776644 ) Test set accuracy: 90.48 % Nested CV Accuracy RFC: 79.285714 (+/- 5.112122 ) Test set accuracy: 75.00 % Nested CV Accuracy LR: 91.904762 (+/- 4.416009 ) Test set accuracy: 92.86 % Nested CV Accuracy SVM: 94.285714 (+/- 3.563483 ) Test set accuracy: 96.43 % Nested CV Accuracy MLP: 91.428571 (+/- 4.012452 ) Test set accuracy: 92.86 % Random data code:
ran = np.random.randint(4, size=161) random = np.random.normal(500, 100, size=(161,161)) rand = np.column_stack((random, ran)) print(rand.shape) X1 = rand[:161] Y1 = rand[:,-1] print("Random data counts of label '1': {}".format(sum(ran==1))) print("Random data counts of label '0': {}".format(sum(ran==0))) print("Random data counts of label '2': {}".format(sum(ran==2))) print("Random data counts of label '3': {}".format(sum(ran==3))) for name, model in models: cv_results = model_selection.cross_val_score(model, X1, Y1, cv=outer_cv, scoring=scoring) names.append(name) msg = "Random data CV %s: %f (+/- %f)" % (name, cv_results.mean()*100, cv_results.std()*100) print(msg) Random data output:
Random data CV GBM: 100.000000 (+/- 0.000000) Random data CV RFC: 62.941176 (+/- 15.306485) Random data CV LR: 23.566176 (+/- 6.546699) Random data CV SVM: 22.352941 (+/- 6.331220) Random data CV MLP: 23.639706 (+/- 7.371392) Random data CV Keras: 22.352941 (+/- 8.896451) This gradient boosting classifier (GBM) is at 100% whether I reduce the number of features, change the parameters in the grid search (I do put in multiple parameters however this can run for hours for me without results so I have left that problem for now), and is also the same if I try binary classification data.
The random forest (RFC) is also higher at 62%, is there something I am doing wrong?
The data I am using is predominantly binary features, as an example looking like this (and predicting the category column):
gene Tissue Druggable Eigenvalue CADDvalue Catalogpresence Category ACE 1 1 1 0 1 Certain ABO 1 0 0 0 0 Likely TP53 1 1 0 0 0 Possible Any guidance would be appreciated.