Question regarding DecisionTreeClassifier

Question

I am making an explainable model with the past data, and not going to use it for future prediction at all.

In the data, there are a hundred X variables, and one Y binary class and trying to explain how Xs have effects on Y binary (0 or 1).

I came up with DecisionTree classifier as it clearly shows us that how decisions are made by value criterion of each variable

Here are my questions:

Is it necessary to split X data into X_test, X_train even though I am not going to predict with this model? ( I do not want to waste data for the test since I am interpreting only)
After I split the data and train model, only a few values get feature importance values (like 3 out of 100 X variables) and rest of them go to zero. Therefore, there are only a few branches. I do not know reason why it happens.

If here is not the right place to ask such question, please let me know.

Thanks.

pu239 · Accepted Answer · 2021-05-14 05:00:09Z

No it is not necessary but it is a way to check if your decision tree is overfitting and just remembering the input values and classes or actually learning the pattern behind it. I would suggest you look into cross-validation since it doesn't 'waste' any data and trains and tests on all the data. If you need me to explain this further, leave a comment.
Getting any number of important features is not an issue since it does depend very solely on your data.
Example: Let's say I want to make a model to tell if a number will be divisible by 69 (my Y class).
I have my X variables as divisibility by 2,3,5,7,9,13,17,19 and 23. If I train the model correctly, I will get feature importance of only 3 and 23 as very high and everything else should have very low feature importance.
Consequently, my decision tree (trees if using ensemble models like Random Forest / XGBoost) will have less number of splits. So, having less number of important features is normal and does not cause any problems.

Rahul · Accepted Answer · 2021-05-14 05:10:22Z

No, it isn't. However, I would still split train-test and measure performance separately. While an explainable model is nice, it is significantly less nicer if it's a crap model. I'd make sure it had at least a reasonable performance before considering interpretation, at which point the splitting is unnecessary.
The number of important features is data-dependent. Random forests do a good job providing this as well. In any case, fewer branches is better. You want a simpler tree, which is easier to explain.

2 Answers 2