1

I'm currently working on a model to predict a probability of fatality once a person is infected with the Corona virus. I'm using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc. It was suggested to use a decision tree, which I've already built. Since I'm new to decision trees I would like some assistance. I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output. How can I achieve this? Also I want to play around with samples by inputting the data myself and see what the outcome is. For instance: let's take someone who is 40, male etc. and calculate what its survival chance is. How can I achieve this? I've attached the code below:

from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import pandas as pd import random as rnd filename = '/Users/sef/Downloads/pima-indians-diabetes.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(filename, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234) model = DecisionTreeClassifier() model.fit(X_train, Y_train) DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') rnd.seed(123458) X_new = X[rnd.randrange(X.shape[0])] X_new = X_new.reshape(1,8) YHat = model.predict_proba(X_new) df = pd.DataFrame(X_new, columns = names[:-1]) df["predicted"] = YHat print(df) 

3 Answers 3

0

you can use the method "predict_proba" of the DecisionTreeClassifier to compute the probabilities instead of the binary classification values.

In order to test individual data, that you can create by hand, you have to create an array of the shape of your X_test data (just that it only has one entry). Then you can use that with model.predict(array) or model.predict_proba(array).

By the way, your tree is currently not useful for retrieving probabilities. There is an article that explains the problem very well: https://web.archive.org/web/20210507022823/https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html

So you can fix your code by defining the max_depths of your tree:

from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import pandas as pd import random as rnd filename = 'pima-indians-diabetes.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(filename, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234) model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') model.fit(X_train, Y_train) rnd.seed(123458) X_new = X[rnd.randrange(X.shape[0])] X_new = X_new.reshape(1,8) YHat = model.predict_proba(X_new) df = pd.DataFrame(X_new, columns = names[:-1]) df["predicted"] = list(YHat) print(df) 
Sign up to request clarification or add additional context in comments.

12 Comments

I get the following error when using the predict_proba function, ValueError: Wrong number of items passed 3, placement implies 1
Can you provide a reproducable example for debugging?
after clearing the variables in the console and rerunning the code I get a different error: raise ValueError("Classification metrics can't handle a mix of {0} " ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets. What do you mean with a reproducable example?
casting it to a list does the trick. Thanks a lot Kim!
I see it, will dive into it.
|
0

Decision Tree can also estimate the probability than an instance belongs to a particular class. Use predict_proba() as below with your train feature data to return the probability of various class you want to predict. model.predict() returns the class which has the highest probability

model.predict_proba() 

1 Comment

Thanks Praks! However, I get the following error: ValueError: Wrong number of items passed 3, placement implies 1
0

Use the function called predict_proba model.predict_proba(X_test)

To the second part of your question, here is what you will have to do. Create your own custom dataset with the exact same column names as you had trained. Read your data from a csv and apply the same encoder values if any.

You can also save your label encoder object in a much more efficient way.

label = preprocessing.LabelEncoder() label_encoded_columns=['Date_statistics_type', 'Agegroup', 'Sex', 'Province', 'Hospital_admission', 'Municipal_health_service', 'Deceased'] for col in label_encoded_columns: dataframe[col] = dataframe[col].astype(str) Label_Encoder = labelencoder.fit(dataframe[label_encoded_columns].values.flatten()) Encoded_Array = (Label_Encoder.transform(dataframe[label_encoded_columns].values.flatten())).reshape(dataframe[label_encoded_columns].shape) LE_Dataframe=pd.DataFrame(Encoded_DataFrame,columns=label_encoded_columns,index=dataframe.index) LE_mapping = dict(zip(Label_Encoder.classes_,Label_Encoder.transform(Label_Encoder.classes_).tolist())) #####This should give you dictionary in the form for all your list of values. ##### for eg: {'Apple':0,'Banana':1} 

For your second part of the question, there can be two ways. The first one is pretty straightforward, where in you can use values of X_test to give you a resulting prediction. model.predict(X_test.iloc[0:30]) ###First 30 rows model.predict_proba(X_test.iloc[0:30])

In the second one, if you are talking about introducing new data, then in that case, you will have to label encode the raw data once again.

If that data is not present, it may give you never seen before values error.

Refer to this link in that case

1 Comment

Thank you, this makes it more clear! Trying to use the predict_proba function now.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.