Expected Calibration Error — ece • probcal

ece() returns the empirical weighted average gap between mean confidence and empirical event frequency across equal-width probability bins. It is zero when confidence and accuracy match in every non-empty bin of the chosen partition.

Usage

ece(
  p,
  y,
  bins = 10,
  type = c("classwise", "confidence"),
  debiased = FALSE,
  strategy = c("width", "mass"),
  norm = c("l1", "l2")
)

Arguments

p: Predicted probabilities. A numeric vector in [0, 1] for binary problems, or a numeric matrix with one column per class for multiclass problems. Matrix inputs must have finite entries in [0, 1], at least two columns, and rows summing to one within absolute tolerance 1e-6.
y: Outcome labels. A vector coded as 0 and 1 for binary problems, or a factor or vector of integer class codes in 1:K for multiclass problems.
bins: Number of bins on [0, 1]. Must be a single positive integer.
type: Multiclass aggregation, either "classwise" or "confidence". Ignored for binary inputs.
debiased: Logical. If FALSE (the default) the plug-in estimate is returned. If TRUE the square root of the debiased squared calibration error estimator of Kumar, Liang and Ma (2019) is returned; this requires norm = "l2".
strategy: Binning strategy, either "width" for equal-width bins (the default, reproducing the historical behaviour) or "mass" for equal-frequency bins.
norm: Scale of the calibration error: "l1" (the default) for the weighted mean absolute bin gap, or "l2" for the (root) weighted mean squared bin gap. The debiased correction is only defined for "l2". The "l1" and "l2" values are on different scales and should not be compared directly; the norm and debiased choices are independent.

Value

A single numeric value.

Details

For binary problems p is a probability vector. For multiclass problems p is a probability matrix with one column per class and type selects the multiclass definition. The "classwise" form averages the binary ECE over the one-vs-rest columns, also known as the static calibration error. The "confidence" form applies the binary ECE to the top-label confidence and whether the predicted class is correct, which is the definition used by Guo et al. (2017).

For binary calibration, the interval [0, 1] is split into $B$ equal-width bins. The package uses left-closed bins, $I_b = \{i: (b - 1)/B \le p_i < b/B\}$ for $b < B$, and $I_B = \{i: (B - 1)/B \le p_i \le 1\}$ for the last bin. Let $n_b = |I_b|$ and $n = \sum_b n_b$. For each non-empty bin,

$$\operatorname{conf}(b) = \frac{1}{n_b}\sum_{i \in I_b} p_i,$$

and

$$\operatorname{acc}(b) = \frac{1}{n_b}\sum_{i \in I_b} y_i.$$

The returned empirical ECE is

$$\operatorname{ECE} = \sum_{b: n_b > 0} \frac{n_b}{n} |\operatorname{acc}(b) - \operatorname{conf}(b)|.$$

Empty bins have zero weight. The estimate depends on bins; changing the number of bins changes the empirical partition and can change the value. A value of zero means equality of sample bin means for this partition, not full population calibration.

For a probability matrix, type = "classwise" computes the binary ECE for each one-vs-rest column $p_{\cdot k}$ against $\mathbf{1}\{y_i = k\}$ and returns their arithmetic mean,

$$\operatorname{ECE}_{\mathrm{cw}} = \frac{1}{K}\sum_{k = 1}^K \operatorname{ECE}(p_{\cdot k}, \mathbf{1}\{y_i = k\}).$$

type = "confidence" uses the top-label rule $\hat y_i = \min\{k: p_{ik} = \max_\ell p_{i\ell}\}$, the confidence $r_i = p_{i\hat y_i}$, and the correctness indicator $c_i = \mathbf{1}\{\hat y_i = y_i\}$, then applies the binary definition to $(r_i, c_i)$: $\operatorname{ECE}_{\mathrm{conf}} = \operatorname{ECE}(r, c)$. For matrix inputs, column $k$ corresponds to integer class code $k$; if y is a factor, column $k$ corresponds to levels(y)[k].

Here "calibrated" refers to the output of a fitted calibration map. It does not imply population calibration. Binary population calibration can be stated as $E(Y \mid Q) = Q$ for the predicted probability random variable $Q$. For top-label confidence $R$, the analogous condition is $E[\mathbf{1}\{\hat Y = Y\} \mid R] = R$.

References

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning.

Kumar, A., Liang, P., & Ma, T. (2019). Verified uncertainty calibration. Advances in Neural Information Processing Systems 32. arXiv:1909.10155.

Roelofs, R., Cain, N., Shlens, J., & Mozer, M. C. (2022). Mitigating bias in calibration error estimation. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics.

Examples

predictions <- data.frame(
  p = c(0.10, 0.20, 0.80, 0.90),
  y = c(0, 0, 1, 1)
)

predictions |>
  dplyr::summarise(ece = ece(p, y, bins = 2))
#>    ece
#> 1 0.15

# Debiased squared-ECE estimate with equal-mass bins.
set.seed(33)
p <- stats::runif(500)
y <- rbinom(500, 1, p)
ece(p, y, bins = 15)
#> [1] 0.05415245
ece(p, y, bins = 15, debiased = TRUE, strategy = "mass", norm = "l2")
#> [1] 0.03714531

# Multiclass classwise ECE from a probability matrix.
set.seed(30)
prob <- matrix(stats::runif(150 * 3), ncol = 3)
prob <- prob / rowSums(prob)
labels <- max.col(prob)
ece(prob, labels, bins = 10, type = "classwise")
#> [1] 0.2264214