ece() returns the empirical weighted average gap between mean confidence
and empirical event frequency across equal-width probability bins. It is zero
when confidence and accuracy match in every non-empty bin of the chosen
partition.
Usage
ece(p, y, bins = 10, type = c("classwise", "confidence"))Arguments
- p
Predicted probabilities. A numeric vector in
[0, 1]for binary problems, or a numeric matrix with one column per class for multiclass problems. Matrix inputs must have finite entries in[0, 1], at least two columns, and rows summing to one within absolute tolerance1e-6.- y
Outcome labels. A vector coded as
0and1for binary problems, or a factor or vector of integer class codes in1:Kfor multiclass problems.- bins
Number of equal-width bins on
[0, 1]. Must be a single positive integer.- type
Multiclass aggregation, either
"classwise"or"confidence". Ignored for binary inputs.
Details
For binary problems p is a probability vector. For multiclass problems p
is a probability matrix with one column per class and type selects the
multiclass definition. The "classwise" form averages the binary ECE over
the one-vs-rest columns, also known as the static calibration error. The
"confidence" form applies the binary ECE to the top-label confidence and
whether the predicted class is correct, which is the definition used by Guo
et al. (2017).
For binary calibration, the interval [0, 1] is split into \(B\)
equal-width bins. The package uses left-closed bins,
\(I_b = \{i: (b - 1)/B \le p_i < b/B\}\)
for \(b < B\), and
\(I_B = \{i: (B - 1)/B \le p_i \le 1\}\)
for the last bin. Let \(n_b = |I_b|\) and
\(n = \sum_b n_b\). For each non-empty bin,
$$\operatorname{conf}(b) = \frac{1}{n_b}\sum_{i \in I_b} p_i,$$
and
$$\operatorname{acc}(b) = \frac{1}{n_b}\sum_{i \in I_b} y_i.$$
The returned empirical ECE is
$$\operatorname{ECE} = \sum_{b: n_b > 0} \frac{n_b}{n} |\operatorname{acc}(b) - \operatorname{conf}(b)|.$$
Empty bins have zero weight. The estimate depends on bins; changing the
number of bins changes the empirical partition and can change the value. A
value of zero means equality of sample bin means for this partition, not full
population calibration.
For a probability matrix, type = "classwise" computes the binary ECE for
each one-vs-rest column \(p_{\cdot k}\) against
\(\mathbf{1}\{y_i = k\}\) and returns their
arithmetic mean,
$$\operatorname{ECE}_{\mathrm{cw}} = \frac{1}{K}\sum_{k = 1}^K \operatorname{ECE}(p_{\cdot k}, \mathbf{1}\{y_i = k\}).$$
type = "confidence" uses the top-label rule
\(\hat y_i = \min\{k: p_{ik} = \max_\ell p_{i\ell}\}\),
the confidence \(r_i = p_{i\hat y_i}\), and the
correctness indicator
\(c_i = \mathbf{1}\{\hat y_i = y_i\}\), then
applies the binary definition to \((r_i, c_i)\):
\(\operatorname{ECE}_{\mathrm{conf}} = \operatorname{ECE}(r, c)\).
For matrix inputs, column \(k\) corresponds to integer class code \(k\);
if y is a factor, column \(k\) corresponds to levels(y)[k].
Here "calibrated" refers to the output of a fitted calibration map. It does not imply population calibration. Binary population calibration can be stated as \(E(Y \mid Q) = Q\) for the predicted probability random variable \(Q\). For top-label confidence \(R\), the analogous condition is \(E[\mathbf{1}\{\hat Y = Y\} \mid R] = R\).
References
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. Proceedings of the 34th International Conference on Machine Learning.
Examples
predictions <- data.frame(
p = c(0.10, 0.20, 0.80, 0.90),
y = c(0, 0, 1, 1)
)
predictions |>
dplyr::summarise(ece = ece(p, y, bins = 2))
#> ece
#> 1 0.15
# Multiclass classwise ECE from a probability matrix.
set.seed(30)
prob <- matrix(stats::runif(150 * 3), ncol = 3)
prob <- prob / rowSums(prob)
labels <- max.col(prob)
ece(prob, labels, bins = 10, type = "classwise")
#> [1] 0.2264214
