When changing the order of the columns of the input for the sklearn DecisionTreeClassifier the accuracy appears to change. This shouldn't be the case. What am I doing wrong?
from sklearn.datasets import load_iris import numpy as np iris = load_iris() X = iris['data'] y = iris['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0) clf = DecisionTreeClassifier(random_state=0) clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) clf = DecisionTreeClassifier(random_state=0) clf.fit(np.hstack((X_train[:,1:], X_train[:,:1])), y_train) print(clf.score(X_test, y_test)) clf = DecisionTreeClassifier(random_state=0) clf.fit(np.hstack((X_train[:,2:], X_train[:,:2])), y_train) print(clf.score(X_test, y_test)) clf = DecisionTreeClassifier(random_state=0) clf.fit(np.hstack((X_train[:,3:], X_train[:,:3])), y_train) print(clf.score(X_test, y_test)) Running this code results in the following output
0.9407407407407408 0.22962962962962963 0.34074074074074073 0.3333333333333333 This has been asked 3 years ago but the questioned got down voted because no code was provided. Does feature order impact Decision tree algorithm in sklearn?
Edit
In the above code I forgot to apply the column reordering to the test data.
I have found the different results to persist when applying the reordering to the whole dataset as well.
First I import the data and turn it into a pandas dataframe.
from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split import numpy as np iris = load_iris() y = iris['target'] iris_features = iris['feature_names'] iris = pd.DataFrame(iris['data'], columns=iris['feature_names']) I then select all of the data via the original ordered feature names. I train and evaluate the model.
X = iris[iris_features].values print(X.shape[1], iris_features) # 4 ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0) clf = DecisionTreeClassifier(random_state=0) clf.fit(X_train, y_train) pred = clf.predict(X_test) print(np.mean(y_test == pred)) # 0.7062937062937062 Why do I still get different results? I then select a different order of the same columns to train and evaluate the model.
X = iris[iris_features[2:]+iris_features[:2]].values print(X.shape[1], iris_features[2:]+iris_features[:2]) # 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0) clf = DecisionTreeClassifier(random_state=0) clf.fit(X_train, y_train) pred = clf.predict(X_test) print(np.mean(y_test == pred)) # 0.8881118881118881
