I am trying to predict wins/losses of tennis matches by predicting win probabilities of each match, and currently thinking about which evaluation measures to use.
Besides using overall evaluation measures like the Brier score, I look at model calibration and discriminative ability separately. I am in doubt what metrics are good to use for model discrimination specifically.
I've read that for assessing model discrimination the AUROC is often used. However, I feel like it is not appropriate in my application, because it doesnt make sense to consider different thresholds than 0.5. Also, measures like precision/recall/F1-score seem not to be appropriate because of my balanced classes (either win or loss, which both occur 50% of the time ofcourse) and the fact that false positives are of similar importance as false negatives.
Therefore, I think simply using prediction accuracy (fraction of correctly predicted wins/losses) is a good metric to use for assessing model discrimination. Is my thought process here correct? Am I missing something? Are there any drawbacks of using accuracy in this application?