This technique adjusts the probability of your model’s prediction to match more closely with the empirical frequency of a label occurring at that given prediction, $P (Y ∣ \overset{p}{^})$ . It adjusts the probability, not the class prediction itself.

Mathematically, a calibrated model is one such that $\overset{p}{^} = P (Y ∣ \overset{p}{^})$ .

Model Calibration

One of the most important tests of a forecast—I would argue that it is the single most important one—is called calibration.

Nate Silver, The Signal and the Noise

Why calibrate models?

Fundamentally, we want to calibrate models to make their probabilities interpretable / grounded to the true frequency of occurrence of an event.

Doing so unlocks a few use cases:

Probabilities can be used as confidence scores. This allows for user interpretability of the scores, as well as for model evaluation.
It unlocks business use cases. For example, in ad click prediction, determine whether to release a particular ad if it’s expected revenue doesn’t exceed its cost.
It keeps model changes modular, particularly in ML systems where one model’s prediction is a feature for a downstream model. In such cases, we want a given model’s prediction to truly reflect the true frequency of occurrence of an event.

Theory

A model can be calibrated but still useless.

For example, let’s think about a email spam prediction problem. If a model predicts $P (s p am) = 0.1$ where 10% of all emails are spam, it is calibrated, but the performance is dismal. Rather, a model that’s conditional on some covariates like $P (s p am ∣ kn o w n se n d er) = 0.01$ and $P (s p am ∣ u nkn o w n se n d er) = 0.4$ , can be calibrated and also useful.

Mathematical derivation and interpretation

An expected quadratic loss function can be decomposed as so:¹

E [(\overset{p}{^} - Y)^{2}] = E [(\overset{p}{^} - P (Y ∣ \overset{p}{^}))^{2}] - E [(P (Y ∣ \overset{p}{^}) - P (Y))^{2}] + P (Y) (1 - P (Y))

The decomposed terms represent the following:

Terms	Description
$E [(\overset{p}{^} - P (Y ∥ \overset{p}{^}))^{2}]$	Calibration. This is 0 for a perfectly calibrated model.
$- E [(P (Y ∥ \overset{p}{^}) - P (Y))^{2}]$	Sharpness. The further away predictions are from the global average, the lower the overall loss. In some ways, this can be thought of as the variance term.
$P (Y) (1 - P (Y))$	Irreducible loss due to uncertainty.

Calibration vs Sharpness

Scenario	Description	Relation between calibration vs sharpness	Effect on quadratic loss
High variance model	Predictions are very far from the global average	Tradeoff Calibration reduces the sharpness, by pulling the scores closer towards the global average	May increase or decrease the quadratic loss
High bias model	Predictions are close to the global average	Decoupled. You can increase sharpness, while maintaining calibration	Loss decreases

Effect on metrics

Calibration does not affect ranking metrics like AUC or accuracy. Such metrics only care about ordered probabilities, instead of the probabilities themselves.

Measuring calibration

To know whether a model is calibrated, you can simply plot a calibration curve.²

Logistic regression functions tend to be perfectly calibrated IF the classes are balanced, and because their loss function maps directly to model calibration.

That said, there are many factors³ that can make a model to be miscalibrated so it’s best to assume that all models are miscalibrated to begin with.

Types of models

Type	How to identify?	Model in example above
Perfectly calibrated	The diagonal line in the chart	Logistic regression
Under-confident model	S-shaped / sigmoid-shaped	SVC
Over-confident model	N-shaped / tangent-shaped	Naive Bayes

How to calibrate a model

Use CalibratedClassifierCV in sklearn.

There’s three factors one needs to consider when using this:

Ensemble?

If ensemble=True, a clone of the estimator is fitted on each cv fold, and calibrated on the held out set. During prediction, predictions from each fold’s estimator are averaged together.

If ensemble=False, cross-validation is used to obtain held-out predictions, and a calibrator is trained on this. The result is just one estimator-calibrator pair.

Factors	`ensemble=True`	`ensemble=False`
Speed	Slower	Faster
Accuracy	Higher	Lower

cv=`prefit`?

If cv=prefit, no fitting is done—only calibration. In such a case, one needs to supply a held-out set for calibration.

Calibration method

Two options are available:

Factors	`method=sigmoid`	`method=isotonic`
Parametricity	Parametric	Non-parametric
Dataset size	Low	Large (~1000s)
Limitations	- Only works on under-confident models - Requires symmetrical calibration errors - Doesn’t work on highly imbalanced datasets	- Tends to overfit on smaller datasets

Comparison against threshold calibration

In threshold calibration, we adjust the threshold to achieve a certain precision-recall tradeoff.
This affects class predictions, but likely isn’t helpful if we care about probability interpretability.

Scaling the model post-threshold calibration to be p > 0.5 as positive labels may further miscalibrate the model.

Not sure how this is derived though ↩
a.k.a. reliability curve ↩
model architecture being miscalibrated out-of-the-box, curse of dimensionality, regularization, re-weighting and boosting ↩

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Model Calibration

Why calibrate models?

Theory

Mathematical derivation and interpretation

Calibration vs Sharpness

Effect on metrics

Measuring calibration

Types of models

How to calibrate a model

Ensemble?

cv=`prefit`?

Calibration method

Comparison against threshold calibration

Graph View

Table of Contents

Backlinks

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Model Calibration

Why calibrate models?

Theory

Mathematical derivation and interpretation

Calibration vs Sharpness

Effect on metrics

Measuring calibration

Types of models

How to calibrate a model

Ensemble?

cv=prefit?

Calibration method

Comparison against threshold calibration

Footnotes

Graph View

Table of Contents

Backlinks

cv=`prefit`?