I'm working over fetch_kddcup99 data set, and by using pandas I've converted the original dataset to something like this, with all dummy variables as this:
Note that after dropping duplicates, the final dataframe only contains 149 observations.
Then I start the feature engineering phase, by OHE the protocol_type, which is a string categorical variable and transform y to 0,1.
X = pd_data.drop(target, axis=1) y = pd_data[target] y=y.astype('int') protocol_type = [['tcp','udp','icmp']] col_transformer = ColumnTransformer([ ("encoder_tipo1", OneHotEncoder(categories=protocol_type, handle_unknown='ignore'), ['protocol_type']), ]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=89) Finally I proceed to the model evaluation, which drops me the following result:
models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('DTC', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('RFC', RandomForestClassifier())) models.append(('SVM', SVC())) #selector = SelectFromModel(estimator=model) scaler = option2 selector = SelectKBest(score_func=f_classif,k = 3) results=[] for name, model in models: pipeline = make_pipeline(col_transformer,scaler,selector) #print(pipeline) X_train_selected = pipeline.fit_transform(X_train,y_train) #print(X_train_selected) X_test_selected = pipeline.fit_transform(X_test,y_test) modelo = model.fit(X_train_selected, y_train) kf = KFold(n_splits=10, shuffle=True, random_state=89) cv_results = cross_val_score(modelo,X_train_selected,y_train,cv=kf,scoring='accuracy') results.append(cv_results) print(name, cv_results) plt.boxplot(results) plt.show() My question is why the models are all the same? Could it be due to the small number of rows of the dataframe, or am I doing something wrong?