Four general requirements of a generic ML system:
Reliability
Types | Descriptions |
---|---|
Engineering reliability | Logical correctness in ML and business logic (i.e. proper functions are being called). Also related to metrics like uptime and DORA stability metrics |
ML reliability | Predictions are correct, and there are no silent failures or incorrect predictions |
Scalability
ML applications can scale in a few aspects:
- model complexity
- model counts1
- traffic
As this happens, attention has to be given to ML Resource Management and ML Artifact Management.
Maintainability
Aspects | Description |
---|---|
Reproducibility | Requires: - code - data - artifacts - context (how different pieces work/are strung together) |
Effective collaboration | Between teams, or team members |
Adaptability
A system should easily accept improvements without service interruptions.
Improvement Type | Description |
---|---|
Data | More / recent / higher-quality data points |
Model | New architecture; more features |
Deploying these changes should be easy and fast, following DORA speed metrics.
Footnotes
-
more models for more use-cases/customers; common in multi-tenant and SaaS applications ↩