Why are the ROC curves not smooth?

Question

The following are some performance results that I got from the currently trained model on both the training and validation data sets. There are 3 classes with imbalanced training samples. I use the sklearn.metrics to compute the metrics with average='weighted'.

And the following are the ROC curves (the first is from the training data set and the second is from the validation data set).

Class 0 (denoted as C0) is the background class, Class 1 (denoted as C1) and Class 2 (denoted as C2) are the positive classes. I want to increase the accuracy on both C1 and C2. The ROC curves seem to be not smooth. Is this a valid model? What can I get from these results? How to improve them, especially to tackle the class imbalance problem? Any comments are appreciated. Thanks!

UPDATED： The source code is as follows:

A ROC curve is never smooth - the number of "steps" in a ROC curve depends on the number of thresholds you have available/use. It would seem that your analysis would use only three (or four judging by macro-average) thresholds (i don't know the exact values of your python stuff) thresholds. Or, on the other hand, your get all the same results (TPR and FPR) for three to four ranges of thresholds. To interpret your curve, you need to show data. And some visualizations may help, too. — Drey
– Drey, Commented Feb 21, 2017 at 8:28
Hi, @Drey, thanks for comments! I just follow the example here scikit-learn.org/stable/auto_examples/model_selection/… to compute the metrics. Yes, the obtained TPR or FPR just have three values. I'm confused of this result, so I'm not sure if the result is reasonable. And I've updated the post. Please kindly check them. — mining
– mining, Commented Feb 21, 2017 at 8:54
@Drey, I know where the problem is. The y_pred_1d is the result of argmax(softmax(output)) and it is not probability estimate or confidential score. So I should use the result of softmax(output). — mining
– mining, Commented Feb 21, 2017 at 9:01

nid · Accepted Answer · 2019-05-15 11:23:16Z

I know the question is two years old and the technical answer was given in the comments, but a more elaborate answer might help others still struggling with the concepts.

OP's ROC curve wrong because he used the predicted values of his models instead of the probabilities.

What does this mean?

When a model is trained it learns the relationships between the input variables and the output variable. For each observation the model is shown, the model learns how probable it is that a given observation belongs to a certain class. When the model is presented with the test data it will guess for each unseen observation how probable it is to belong to a given class.

How does the model know if an observation belongs to a class? During testing the model receives an observation for which it estimates a probability of 51% of belonging to Class X. How does take the decision to label as belonging to Class X or not? The researcher will set a threshold telling the model that all observations with a probability under 50% must be classified as Y and all those above must be classified as X. Sometimes the researcher wants to set a stricter rule because they're more interested in correctly predicting a given class like X rather than trying to predict all of them as well.

So you trained model has estimated a probability for each of your observations, but the threshold will ultimately decide to in which class your observation will be categorized.

Why does this matter?

The curve created by the ROC plots a point for each of the True positive rate and false positive rate of your model at different threshold levels. This helps the researcher to see the trade-off between the FPR and TPR for all threshold levels.

So when you pass the predicted values instead of the predicted probabilities to your ROC you will only have one point because these values were calculated using one specific threshold. Because that point is the TPR and FPR of your model for one specific threshold level.

What you need to do is use the probabilities instead and let the threshold vary.

Run your model as such:

from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() knn_model = knn.fit(X_train,y_train) #Use the values for your confusion matrix knn_y_model = knn_model.predict(X=X_test) # Use the probabilities for your ROC and Precision-recall curves knn_y_proba = knn_model.predict_proba(X=X_test)

When creating your confusion matrix you will use the values of your model

from mlxtend.plotting import plot_confusion_matrix fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix(y_test,knn_y_model), show_absolute=True,show_normed=True,colorbar=True) plt.title("Confusion matrix - KNN") plt.ylabel('True label') plt.xlabel('Predicted label'

When creating your ROC curve you will use the probabilities

import scikitplot as skplt plot = skplt.metrics.plot_roc(y_test, knn_y_proba) plt.title("ROC Curves - K-Nearest Neighbors")

Thanks a lot for your valuable comments for a two-year old question! — mining
– mining, Commented Aug 26, 2019 at 3:26
I had the issue myself and found your post when looking for help. Thought it could help others ;) — nid
– nid, Commented Aug 26, 2019 at 11:00

Stack Exchange Network

Why are the ROC curves not smooth?

1 Answer 1

Hot Network Questions

Why are the ROC curves not smooth?

1 Answer 1

Related

Hot Network Questions