0

I'm working over fetch_kddcup99 data set, and by using pandas I've converted the original dataset to something like this, with all dummy variables as this:

DataFrame

Note that after dropping duplicates, the final dataframe only contains 149 observations.

Then I start the feature engineering phase, by OHE the protocol_type, which is a string categorical variable and transform y to 0,1.

X = pd_data.drop(target, axis=1) y = pd_data[target] y=y.astype('int') protocol_type = [['tcp','udp','icmp']] col_transformer = ColumnTransformer([ ("encoder_tipo1", OneHotEncoder(categories=protocol_type, handle_unknown='ignore'), ['protocol_type']), ]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=89) 

Finally I proceed to the model evaluation, which drops me the following result:

models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('DTC', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('RFC', RandomForestClassifier())) models.append(('SVM', SVC())) #selector = SelectFromModel(estimator=model) scaler = option2 selector = SelectKBest(score_func=f_classif,k = 3) results=[] for name, model in models: pipeline = make_pipeline(col_transformer,scaler,selector) #print(pipeline) X_train_selected = pipeline.fit_transform(X_train,y_train) #print(X_train_selected) X_test_selected = pipeline.fit_transform(X_test,y_test) modelo = model.fit(X_train_selected, y_train) kf = KFold(n_splits=10, shuffle=True, random_state=89) cv_results = cross_val_score(modelo,X_train_selected,y_train,cv=kf,scoring='accuracy') results.append(cv_results) print(name, cv_results) plt.boxplot(results) plt.show() 

Boxplots from CV

My question is why the models are all the same? Could it be due to the small number of rows of the dataframe, or am I doing something wrong?

1 Answer 1

1

You have 149 rows, of which 80% go into the training set, so 119. You then do 10-fold cross-validation, so each test fold has about 12 samples. So each individual test fold has only 13 possible accuracy scores; even if the classifiers predict some samples a little differently, they may have the same accuracy. (The common scores you see (1, 0.88, 0.71) don't line up with the fractions I'm expecting though, so maybe I've missed something?) So yes, possibly it's just the small number of rows, compounded with the cross-validation. Selecting down to just 3 features also probably contributes.

One quick thing to check is some continuous score of the models' performance, say log-loss or Brier score.

(And, Gaussian is probably the wrong Naive Bayes to use with your data, containing so many binary features.)

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.