cal_vector_scaling() is the multiclass generalization of temperature
scaling that gives each class its own scale and bias. It rescales a logit
matrix column by column and applies the softmax. With a single shared scale
and no bias it reduces to temperature scaling, so it is more flexible while
remaining cheap to fit.
Value
A cal_vector_scaling object that also inherits from
cal_multiclass. Use predict() with new logits to obtain calibrated
probabilities.
Details
The calibrated probabilities are softmax(s * logits + b), where s is a
length K vector of per-class scales applied column by column and b is a
length K vector of per-class biases. Parameters are estimated by minimizing
the average multiclass negative log-likelihood.
Let \(z_{ik}\) be the uncalibrated logit for observation \(i\) and class \(k\). Vector scaling estimates class-specific scales \(s_k\) and intercepts \(b_k\), then forms calibrated logits
$$\eta_{ik} = s_k z_{ik} + b_k.$$
The predicted probabilities are obtained with the softmax,
$$q_{ik} = \frac{\exp(\eta_{ik})} {\sum_{\ell = 1}^K \exp(\eta_{i\ell})}.$$
Parameters are estimated by minimizing
$$L(s, b) = -\frac{1}{n}\sum_{i = 1}^n \log q_{i y_i}.$$
For multiclass labels, column \(k\) of logits corresponds to class code
\(k\); if y is a factor, column \(k\) corresponds to levels(y)[k].
The implementation uses stats::optim() with method "BFGS", analytic
gradients, initial scales \(s_k = 1\), initial biases
\(b_k = 0\), and maxit = 500. True-class probabilities entering
logarithms are clipped to [1e-15, 1 - 1e-15]. The returned object stores
scale, bias, the optimized average negative log-likelihood value, and
the optimizer convergence code.
The scales are unconstrained in the fitted optimization, so a negative scale is possible when it improves the likelihood on the calibration data. Unlike temperature scaling, vector scaling can change the predicted class because scales and biases vary by class. As with any softmax model, adding the same constant to every class bias does not change the resulting probability vector, so the fitted bias vector is identifiable only up to a common additive constant.
References
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning.
Examples
set.seed(22)
logits <- matrix(rnorm(200 * 3), ncol = 3)
labels <- max.col(logits)
fit <- cal_vector_scaling(logits, labels)
head(predict(fit, logits))
#> 1 2 3
#> [1,] 9.344745e-29 1.000000e+00 3.648798e-26
#> [2,] 1.000000e+00 2.108993e-107 1.224984e-16
#> [3,] 1.000000e+00 6.005877e-82 2.511543e-08
#> [4,] 9.081856e-01 3.736703e-25 9.181436e-02
#> [5,] 5.769597e-14 3.027013e-48 1.000000e+00
#> [6,] 1.000000e+00 2.170890e-77 2.723746e-34
