Data leakage detection
Question
How do you know if there’s leakage in training in the first place?
- Check modelling and featurization steps for leakage
- ensure that features do not use held-out test statistics
- ensure that test set isn’t used to inform hyperparameter optimization
- Check feature importance plot of model. Investigate features that score highly
- Run through failure modes below.
Failure modes
Failure Modes | Description | Mitigation |
---|---|---|
Splitting time-correlated data randomly | - Understand whether data has time correlation - Respect time correlation when splitting data | |
Feature engineering before splitting | Statistics about the inference set are leaked into training | - Best practice: Only perform featurization after splitting - Use .fit_transform paradigm |
Incorrect use of time-evolving features | The “snapshot” of a feature is past the point of inference, where the label is already known. | - Check if data fields changes over time for a given entity - Do point-in-time modelling of data - Sync data more frequently - Use data “snapshot” at the point of inference - Data distribution analysis |
Data duplication in different splits | Training data that’s duplicated in the validation and test data will artificially boost the performance | - Check for data duplication before and after splitting |
Group leakage | Semantically duplicated data points appear in the validation and test set. Examples include the same photo of a specific subject taken milliseconds apart | - No easy way to check this except to go through data manually - tSNE may help |
Leakage form data generation | The method or hardware that generated the data leaks information about the label. Common when subsets of the dataset come from different sources. | - Domain knowledge - Data normalization by source - Tracking data lineage |