Four general requirements of a generic ML system:

  1. Reliability
  2. Scalability
  3. Maintainability
  4. Adaptability

Reliability

TypesDescriptions
Engineering reliabilityLogical correctness in ML and business logic (i.e. proper functions are being called).

Also related to metrics like uptime and DORA stability metrics
ML reliabilityPredictions are correct, and there are no silent failures or incorrect predictions

Scalability

ML applications can scale in a few aspects:

  • model complexity
  • model counts1
  • traffic

As this happens, attention has to be given to ML Resource Management and ML Artifact Management.

Maintainability

AspectsDescription
ReproducibilityRequires:
- code
- data
- artifacts
- context (how different pieces work/are strung together)
Effective collaborationBetween teams, or team members

Adaptability

A system should easily accept improvements without service interruptions.

Improvement TypeDescription
DataMore / recent / higher-quality data points
ModelNew architecture; more features

Deploying these changes should be easy and fast, following DORA speed metrics.

Footnotes

  1. more models for more use-cases/customers; common in multi-tenant and SaaS applications