A model can be tested using the following types of tests:

  1. Perturbation tests
  2. Invariance tests
  3. Directional expectation tests
  4. Model calibration
  5. Confidence measurement
  6. Slice-based evaluation

Perturbation tests

Adjust the input features slightly and ensure that the model’s prediction remains unchanged.

One way of measuring or logging this is the number of predictions that changed with an amount of perturbation.

Important

Related to this idea, and perhaps more important is ensuring that the training and inference datasets are i.i.d.

Todo

What techniques/libraries are there for different data modalities?

Invariance tests

Identify invariances that the model is expected to have, given the problem domain. Ensure the model gives the same output.

If possible, it’s better to remove features that shouldn’t affect the model’s prediction in the first place. This point can apply to both PII fields, or uninformative features, lest they inform the model in any way.

Directional expectation tests

Certain fields when changed, should led to a logical change in the model’s prediction.

For example, increasing the square footage of a house in a house prediction model should yield a higher house price.

Model Calibration

This is relevant if your use cases cares about the predictions of the model aligning with the some business interpretation.

For example, if a Netflix user likes watching action and romantic shows 80% and 20% of the time respectively, the model should prediction of action for that user should be 80%, and not 100% of the time.

Confidence measurement

This occurs on a per-sample level, and is very related to Model Calibration.

Does your ML use case have special handling for unconfident predictions?

Slice-based evaluation

Why evaluate performance on population slices?

  1. Global population metrics can be misleading.1
  2. Model performance may be more important for certain sub-populations
  3. It helps to identify (non-ML) issues unique to certain sub-populations

How to form sub-populations?

  1. Use heuristics
  2. Observations from performing error analysis
  3. Data-driven solutions2

Footnotes

  1. Look up the Simpson’s Paradox ↩

  2. Check out Slice Finder ↩