Data Drift

Types of Data Drift

Three types:

Covariate/Feature Drift
Label Drift
Concept Drift

Covariate/Feature Drift

$P (X) \neq = P_{re f} (X)$ while $P (Y ∣ X) = P_{re f} (Y ∣ X)$

This can stem from various reasons including:

Changes in the real world
Featurization issues
Different training distribution compared to inference

Importance Weighting

If the training and inference distributions should be the same, but differ, and you know how they differ, you can weight the training samples by their probability density ratios.

Label Drift

$P (Y) \neq = P_{re f} (Y)$ while $P (X ∣ Y) = P_{re f} (X ∣ Y)$

Concept Drift

$P (Y ∣ X) \neq = P_{re f} (Y ∣ X)$ while $P (X) = P_{re f} (X)$

This can be thought of as a feature drift, if we are able to model changes in the extraneous, latent feature. Such a feature may not change very often, hence it being a change in the concept learnt by the model.

Detecting drifts

Three ways include:

Two-sample tests
Compare distributions across timescales
Reduce dimensionality and visually compare

Tip

If a statistically significant difference is detectable from a small amount of samples, it’s likely to be a serious difference.

Two-sample tests

There are many such tests, where hypothesis testing is made to see if two sets of samples come from the same distribution.

A very popular one is Kolmogorov-Smirnoff (KS) test. There are more available in Alibi Detect.

The one issue with such tests in general is that they work better on 1-D data, and not high dimensional data. They also tend to produce a lot of false positiives.

Compare distributions across timescales

By lengthening the timescale (and collecting more data), one can observe trends better.

Also, use sliding window statistics over cumulative ones as the former is more sensitive to changes in performance metrics.

Reduce dimensionality and visually compare

This is my own idea. For high-dimensional datasets where the features may not be independent, reducing the dimensions of the features using algorithms like PCA or tSNE can help to distinguish between feature and label drifts.

Train on massive, varied datasets

The larger and the more varied the dataset, the higher the odds of the inference distribution appearing in the training set.

Fine-tuning

Fine-tune the model on more specific examples missed in production.

Could Cleanlab come into play here?

Preventing / Mitigating Data Drift

Consider feature staleness
- Good features that go stale quickly need more often retraining
Use different models for different sub-populations if drift happens at different rates
- If a sub-population’s data changes much more quickly than other populations, it might be better to use a separate model for that, and avoiding for all models to train at the same rate
- Worth monitoring each feature drift over time

🪴 Chris' Digital Garden

Recent Notes

Arithmetic Intensity of a Neural Network Linear Layer

Automatic Material System

Explorer

Data Drift

Types of Data Drift

Covariate/Feature Drift

Label Drift

Concept Drift

Detecting drifts

Two-sample tests

Compare distributions across timescales

Reduce dimensionality and visually compare

Addressing Drifts

Train on massive, varied datasets

Fine-tuning

Preventing / Mitigating Data Drift

Graph View

Table of Contents

Backlinks