Some observations, now I've had time to think about what I did.
While I want to evaluate the model itself, rather than the decisions it will be used for, at the time of decision-making there will be a chosen threshold. Calculating expected confusion counts in the manner described above is equivalent to my belief in the decision thresholds used being a uniform distribution over $[0, 1]$.
One issue with using the mean confusion values to calculate the other metrics is that it causes some of them to be biased. For example, mean recall/sensitivity is calculated correctly, since its denominator is threshold-independent: $$\mathbb{E}(\textrm{sens}) = \mathbb{E}\left(\frac{TP}{TP + FN}\right) = \frac{\mathbb{E}(TP)}{\textrm{# actual positives}},$$ where expectation ($\mathbb{E}$) is taken over the threshold value. We also correctly calculate the mean specificity correctly. However, for precision/PPV and NPV, using the mean confusion values introduces a bias, since the denominator depends on the threshold: $$\mathbb{E}(\textrm{prec}) = \mathbb{E}\left(\frac{TP}{TP + FP}\right) = \frac{\mathbb{E}(TP)}{\mathbb{E}(TP) + \mathbb{E}(FP)} + \mathcal{O}(\textrm{Cov}(TP, FP) + \textrm{Var}(FP)) \, \textrm{as # observations } \uparrow \infty.$$ At the moment, I'm not sure how I would easily calculate the correct mean values for these.
While not bringing up additional problems, here are some additional comments on evaluating a model by applying a belief distribution to the thresholds that will be applied for decision-making. This is not as unusual as it might seem: proper scoring rules can often be described in the same way.
For example, we can follow Rosen [1] or Merkle and Steyvers [2] in considering a classification loss function, with the loss dependent on the threshold value, not the forecast value. The Brier score is then equivalent to using this under a uniform distribution for the threshold; the log score would use a Haldane prior.
By comparison, let's look at using the mean accuracy. This uses a uniform distribution for the threshold, like the Brier score, but the loss function used is based on the forecast value instead. Indeed, we can find that the mean accuracy is equivalent to the "linear score", with loss function $S(f, y) = f(1-y) + (1-f)y$ for forecast $f$ and observation $y$, the classical example of a reasonable-looking scoring rule which turns out to be improper.
We can, therefore, say that accuracy is to the linear score as misclassification loss is to the Brier score.
[1] Rosen, D.B. (1996). How Good were those Probability Predictions? The Expected Recommendation Loss (ERL) Scoring Rule. In: Heidbreder, G.R. (eds) Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics, vol 62. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-8729-7_33
[2] Edgar C. Merkle, Mark Steyvers (2013) Choosing a Strictly Proper Scoring Rule. Decision Analysis 10(4):292-304. https://doi.org/10.1287/deca.2013.0280