Types of Data Drift

Three types:

  1. Covariate/Feature Drift
  2. Label Drift
  3. Concept Drift

Covariate/Feature Drift

while

This can stem from various reasons including:

  • Changes in the real world
  • Featurization issues
  • Different training distribution compared to inference

Importance Weighting

If the training and inference distributions should be the same, but differ, and you know how they differ, you can weight the training samples by their probability density ratios.

Label Drift

while

Concept Drift

while

This can be thought of as a feature drift, if we are able to model changes in the extraneous, latent feature. Such a feature may not change very often, hence it being a change in the concept learnt by the model.


Detecting drifts

Three ways include:

  1. Two-sample tests
  2. Compare distributions across timescales
  3. Reduce dimensionality and visually compare

Tip

If a statistically significant difference is detectable from a small amount of samples, it’s likely to be a serious difference.

Two-sample tests

There are many such tests, where hypothesis testing is made to see if two sets of samples come from the same distribution.

A very popular one is Kolmogorov-Smirnoff (KS) test. There are more available in Alibi Detect.

The one issue with such tests in general is that they work better on 1-D data, and not high dimensional data. They also tend to produce a lot of false positiives.

Compare distributions across timescales

By lengthening the timescale (and collecting more data), one can observe trends better.

Also, use sliding window statistics over cumulative ones as the former is more sensitive to changes in performance metrics.

Reduce dimensionality and visually compare

This is my own idea. For high-dimensional datasets where the features may not be independent, reducing the dimensions of the features using algorithms like PCA or tSNE can help to distinguish between feature and label drifts.


Addressing Drifts

  1. Train on massive, varied datasets
  2. Fine-tuning

Train on massive, varied datasets

The larger and the more varied the dataset, the higher the odds of the inference distribution appearing in the training set.

Fine-tuning

Fine-tune the model on more specific examples missed in production.

todo Could Cleanlab come into play here?


Preventing / Mitigating Data Drift

  • Consider feature staleness
    • Good features that go stale quickly need more often retraining
  • Use different models for different sub-populations if drift happens at different rates
    • If a sub-population’s data changes much more quickly than other populations, it might be better to use a separate model for that, and avoiding for all models to train at the same rate
    • Worth monitoring each feature drift over time