I'm currently working on a model to predict a probability of fatality once a person is infected with the Corona virus. I'm using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc. It was suggested to use a decision tree, which I've already built. Since I'm new to decision trees I would like some assistance. I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output. How can I achieve this? Also I want to play around with samples by inputting the data myself and see what the outcome is. For instance: let's take someone who is 40, male etc. and calculate what its survival chance is. How can I achieve this? I've attached the code below:
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import pandas as pd import random as rnd filename = '/Users/sef/Downloads/pima-indians-diabetes.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(filename, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234) model = DecisionTreeClassifier() model.fit(X_train, Y_train) DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') rnd.seed(123458) X_new = X[rnd.randrange(X.shape[0])] X_new = X_new.reshape(1,8) YHat = model.predict_proba(X_new) df = pd.DataFrame(X_new, columns = names[:-1]) df["predicted"] = YHat print(df)