Types of Data Drift
Three types:
Covariate/Feature Drift
while
This can stem from various reasons including:
- Changes in the real world
- Featurization issues
- Different training distribution compared to inference
Importance Weighting
If the training and inference distributions should be the same, but differ, and you know how they differ, you can weight the training samples by their probability density ratios.
Label Drift
while
Concept Drift
while
This can be thought of as a feature drift, if we are able to model changes in the extraneous, latent feature. Such a feature may not change very often, hence it being a change in the concept learnt by the model.
Detecting drifts
Three ways include:
Tip
If a statistically significant difference is detectable from a small amount of samples, it’s likely to be a serious difference.
Two-sample tests
There are many such tests, where hypothesis testing is made to see if two sets of samples come from the same distribution.
A very popular one is Kolmogorov-Smirnoff (KS) test. There are more available in Alibi Detect.
The one issue with such tests in general is that they work better on 1-D data, and not high dimensional data. They also tend to produce a lot of false positiives.
Compare distributions across timescales
By lengthening the timescale (and collecting more data), one can observe trends better.
Also, use sliding window statistics over cumulative ones as the former is more sensitive to changes in performance metrics.
Reduce dimensionality and visually compare
This is my own idea. For high-dimensional datasets where the features may not be independent, reducing the dimensions of the features using algorithms like PCA or tSNE can help to distinguish between feature and label drifts.
Addressing Drifts
Train on massive, varied datasets
The larger and the more varied the dataset, the higher the odds of the inference distribution appearing in the training set.
Fine-tuning
Fine-tune the model on more specific examples missed in production.
- Could Cleanlab come into play here?
Preventing / Mitigating Data Drift
- Consider feature staleness
- Good features that go stale quickly need more often retraining
- Use different models for different sub-populations if drift happens at different rates
- If a sub-population’s data changes much more quickly than other populations, it might be better to use a separate model for that, and avoiding for all models to train at the same rate
- Worth monitoring each feature drift over time