How exactly do decision trees split the input region?

Question

This is more of a stupid question!

Let's assume I have a rectangular input region with 3 points belonging to class 0 on the left and 1 point belonging to class 1 on the right. Let's assume these points are near the ends of the rectangle.

In this scenario, how will the decision tree make the split? Will it split it right down the middle to offer 50% region to both sides, or will class 0 get 75% (3/4) of the region, leaving 25% (1/4) to class 1?

Thanks in advance!

Ben Reiniger · Accepted Answer · 2024-02-12 16:18:13Z

Long story short:

The border will be between the right-most "left" point and the left-most "right" point.

That's because the algorithm will sort your data points by one feature (in this case, how far right or left they are), and will figure out that the split should be somewhere between the value of the item of class 0 on the left and the left-most item of class 1 on the right. Most implementations will then simply take the halfway point between these two values.

It makes, therefore, no difference how "imbalanced" your problem is if you can cleanly split your classes.

Here is the long story:

This is not a stupid question at all. First, if you have a rectangular region, I assume your input has two continuous features. For the sake of argument, let's say that your data looks like this:

ID	Feature 1	Feature 2	Class
0	0.1	0.3	A
1	0.9	1.0	B
2	0.88	0.99	B
3	0.92	0.94	B

So you have 1 data point in one corner, and 3 data points in the other corner. If we run this using scikit-learn:

import numpy as np from sklearn.tree import DecisionTreeClassifier # Your data X = np.array([[0.1, 0.3], [0.9, 1.0], [0.88, 0.99], [0.92, 0.94]]) # Features y = np.array(["A", "B", "B", "B"]) # Labels # Initialize the Decision Tree Classifier clf = DecisionTreeClassifier() # Fit the model to your data clf.fit(X, y) # Now, clf is your trained model, and you can use it to make predictions # For example, to predict the class of a new sample with Feature 1 = 0.5 and Feature 2 = 0.5 new_sample = np.array([[0.5, 0.5]]) prediction = clf.predict(new_sample) print(f"Prediction for the new sample is: {prediction}")

If you run this multiple times, you'll realize that sometimes that point in the middle of the rectangle is categorized as A and sometimes as B. Why is that? It's because, in this simple example, you only need to split on 1 feature, and which feature you split on will make a difference. Since splitting on either feature gives you a "perfect" split, you could pick either, and there is some randomness as to which feature you'll pick. Since this code doesn't use a random seed, you can get different results.

If you split on feature 1, you get this split

Why? Because the middle point between 0.1 and 0.88 is 0.49

If you split on feature 2, you get this one:

Why? Because the middle point between 0.3 and 0.94 is 0.62

I used the following code to generate the decision trees:

import matplotlib.pyplot as plt from sklearn.tree import plot_tree # Assuming clf is your trained Decision Tree Classifier plt.figure(figsize=(10, 8)) # Set the figure size for better visibility plot_tree( clf, filled=True, rounded=True, feature_names=["Feature 1", "Feature 2"], class_names=["A", "B"], ) plt.show()

Stack Exchange Network

How exactly do decision trees split the input region?

1 Answer 1

Hot Network Questions

How exactly do decision trees split the input region?

1 Answer 1

Related

Hot Network Questions