8
$\begingroup$

I am currently working on the titanic dataset from Kaggle. The data set is imbalanced with almost 61.5 % negative and 38.5 positive class.

I divided my training dataset into 85% train and 15% validation set. I chose a support vector classifier as the model. I did 10-fold Stratified cross-validation on the training set, and I tried to find the optimal threshold to maximize the f1 score for each of the folds. Averaging all of the thresholds obtained on the validation folds, the threshold has a mean of 35% +/- 10%.

After that, I test the model on the validation set and estimated the threshold for maximizing F1 score on the validation set. The threshold for the validation set is about 63%, which is very far from the threshold obtained during cross validation.

I tested the model on the holdout test set from Kaggle and I am unable to get a good score for both of the thresholds (35% from cross-validation of train set and 63% from the validation set.)

enter image description here

How does one determine the optimal threshold from the available dataset which could work well on unseen data? Do I choose the threshold obtained from cross-validation or from the validation set? or am I doing it completely wrong? I would appreciate any help and advice regarding this.

For this Dataset, I am looking to maximize my score on the scoreboard by getting the highest accuracy.

Thank you.

$\endgroup$
2
  • 1
    $\begingroup$ Highest accuracy or $F_1$? $\endgroup$ Commented Jun 16, 2021 at 9:50
  • $\begingroup$ Although I originally wanted to get the highest f1 score, for this Kaggle competition, the metric used for scoring is accuracy. But I would like to know how to optimize the threshold to get the highest F1 score too $\endgroup$ Commented Jun 16, 2021 at 10:18

1 Answer 1

6
$\begingroup$

In short, you should be the judge of that: depending on the precision (interested to minimise "false alarms/FP") and recall (interested to minimise "missed positives/FN") you want your classifier to have.

The appropriate way to look into precision-recall value pairs at different thresholds is a precision-recall curve (PRC) (especially if you want to focus on the minority class). Via a PRC, you can find the optimal threshold as far as model performance go as a function of precision and recall.

I copy below a pseudo-snippet:

from sklearn.metrics import precision_recall_curve model.fit(trainX, trainy) preds = model.predict_proba(testX) # calculate pr curve precision, recall, thresholds = precision_recall_curve(labels, preds) # convert to f score fscore = (2 * precision * recall) / (precision + recall) # locate the index of the largest f score ix = argmax(fscore) print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix])) 

sauce for code

The PRC would look like this: PRC

You can alternatively follow the equivalent approach for ROC curves.

$\endgroup$
2
  • 1
    $\begingroup$ Thank you for your reply. But let me ask another question, would the ideal threshold from the precision-recall curve of the validation set (i.e, when the data is split into train and validation set) be the ideal threshold on unseen data too? or, perhaps should I also cross-validate the train set using stratified folds and obtained corresponding thresholds from the precision-recall curves for each fold? $\endgroup$ Commented Jun 16, 2021 at 16:06
  • 1
    $\begingroup$ You can calculate a PRC and respective best threshold on your testset. But, if your question is "how do I get best performance in relation to precision-recall" you should use either f1 or average precision score for scoring during hyperparameter optimisation. $\endgroup$ Commented Jun 17, 2021 at 10:59

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.