This technique adjusts the probability of your model’s prediction to match more closely with the empirical frequency of a label occurring at that given prediction, . It adjusts the probability, not the class prediction itself.

Mathematically, a calibrated model is one such that .

Model Calibration

One of the most important tests of a forecast—I would argue that it is the single most important one—is called calibration.

Nate Silver, The Signal and the Noise

Why calibrate models?

Fundamentally, we want to calibrate models to make their probabilities interpretable / grounded to the true frequency of occurrence of an event.

Doing so unlocks a few use cases:

  1. Probabilities can be used as confidence scores. This allows for user interpretability of the scores, as well as for model evaluation.
  2. It unlocks business use cases. For example, in ad click prediction, determine whether to release a particular ad if it’s expected revenue doesn’t exceed its cost.
  3. It keeps model changes modular, particularly in ML systems where one model’s prediction is a feature for a downstream model. In such cases, we want a given model’s prediction to truly reflect the true frequency of occurrence of an event.

Theory

A model can be calibrated but still useless.

For example, let’s think about a email spam prediction problem. If a model predicts where 10% of all emails are spam, it is calibrated, but the performance is dismal. Rather, a model that’s conditional on some covariates like and , can be calibrated and also useful.

Mathematical derivation and interpretation

An expected quadratic loss function can be decomposed as so:1

The decomposed terms represent the following:

TermsDescription
Calibration. This is 0 for a perfectly calibrated model.
Sharpness. The further away predictions are from the global average, the lower the overall loss.

In some ways, this can be thought of as the variance term.
Irreducible loss due to uncertainty.

Calibration vs Sharpness

ScenarioDescriptionRelation between calibration vs sharpnessEffect on quadratic loss
High variance modelPredictions are very far from the global averageTradeoff

Calibration reduces the sharpness, by pulling the scores closer towards the global average
May increase or decrease the quadratic loss
High bias modelPredictions are close to the global averageDecoupled.

You can increase sharpness, while maintaining calibration
Loss decreases

Effect on metrics

Calibration does not affect ranking metrics like AUC or accuracy. Such metrics only care about ordered probabilities, instead of the probabilities themselves.


Measuring calibration

To know whether a model is calibrated, you can simply plot a calibration curve.2

Logistic regression functions tend to be perfectly calibrated IF the classes are balanced, and because their loss function maps directly to model calibration.

That said, there are many factors3 that can make a model to be miscalibrated so it’s best to assume that all models are miscalibrated to begin with.

Types of models

TypeHow to identify?Model in example above
Perfectly calibratedThe diagonal line in the chartLogistic regression
Under-confident modelS-shaped / sigmoid-shapedSVC
Over-confident modelN-shaped / tangent-shapedNaive Bayes

How to calibrate a model

Use CalibratedClassifierCV in sklearn.

There’s three factors one needs to consider when using this:

Ensemble?

If ensemble=True, a clone of the estimator is fitted on each cv fold, and calibrated on the held out set. During prediction, predictions from each fold’s estimator are averaged together.

If ensemble=False, cross-validation is used to obtain held-out predictions, and a calibrator is trained on this. The result is just one estimator-calibrator pair.

Factorsensemble=Trueensemble=False
SpeedSlowerFaster
AccuracyHigherLower

cv=prefit?

If cv=prefit, no fitting is done—only calibration. In such a case, one needs to supply a held-out set for calibration.

Calibration method

Two options are available:

Factorsmethod=sigmoidmethod=isotonic
ParametricityParametricNon-parametric
Dataset sizeLowLarge (~1000s)
Limitations- Only works on under-confident models
- Requires symmetrical calibration errors
- Doesn’t work on highly imbalanced datasets
- Tends to overfit on smaller datasets

Comparison against threshold calibration

In threshold calibration, we adjust the threshold to achieve a certain precision-recall tradeoff.
This affects class predictions, but likely isn’t helpful if we care about probability interpretability.

Scaling the model post-threshold calibration to be p > 0.5 as positive labels may further miscalibrate the model.

Footnotes

  1. Not sure how this is derived though

  2. a.k.a. reliability curve

  3. model architecture being miscalibrated out-of-the-box, curse of dimensionality, regularization, re-weighting and boosting