A model can be tested using the following types of tests:

Perturbation tests
Invariance tests
Directional expectation tests
Model calibration
Confidence measurement
Slice-based evaluation

Perturbation tests

Adjust the input features slightly and ensure that the model’s prediction remains unchanged.

One way of measuring or logging this is the number of predictions that changed with an $ϵ$ amount of perturbation.

Important

Related to this idea, and perhaps more important is ensuring that the training and inference datasets are i.i.d.

Todo

What techniques/libraries are there for different data modalities?

Invariance tests

Identify invariances that the model is expected to have, given the problem domain. Ensure the model gives the same output.

If possible, it’s better to remove features that shouldn’t affect the model’s prediction in the first place. This point can apply to both PII fields, or uninformative features, lest they inform the model in any way.

Directional expectation tests

Certain fields when changed, should led to a logical change in the model’s prediction.

For example, increasing the square footage of a house in a house prediction model should yield a higher house price.

Model Calibration

This is relevant if your use cases cares about the predictions of the model aligning with the some business interpretation.

For example, if a Netflix user likes watching action and romantic shows 80% and 20% of the time respectively, the model should prediction of action for that user should be 80%, and not 100% of the time.

Confidence measurement

This occurs on a per-sample level, and is very related to Model Calibration.

Does your ML use case have special handling for unconfident predictions?

Slice-based evaluation

Why evaluate performance on population slices?

Global population metrics can be misleading.¹
Model performance may be more important for certain sub-populations
It helps to identify (non-ML) issues unique to certain sub-populations

How to form sub-populations?

Use heuristics
Observations from performing error analysis
Data-driven solutions²

Look up the Simpson’s Paradox ↩
Check out Slice Finder ↩

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Model QA

Perturbation tests

Invariance tests

Directional expectation tests

Model Calibration

Confidence measurement

Slice-based evaluation

Graph View

Table of Contents

Backlinks

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Model QA

Perturbation tests

Invariance tests

Directional expectation tests

Model Calibration

Confidence measurement

Slice-based evaluation

Footnotes

Graph View

Table of Contents

Backlinks