Data Leakage

Data leakage detection

Question

How do you know if there’s leakage in training in the first place?

Check modelling and featurization steps for leakage
- ensure that features do not use held-out test statistics
- ensure that test set isn’t used to inform hyperparameter optimization
Check feature importance plot of model. Investigate features that score highly
Run through failure modes below.

Failure Modes	Description	Mitigation
Splitting time-correlated data randomly		- Understand whether data has time correlation - Respect time correlation when splitting data
Feature engineering before splitting	Statistics about the inference set are leaked into training	- Best practice: Only perform featurization after splitting - Use `.fit_transform` paradigm
Incorrect use of time-evolving features	The “snapshot” of a feature is past the point of inference, where the label is already known.	- Check if data fields changes over time for a given entity - Do point-in-time modelling of data - Sync data more frequently - Use data “snapshot” at the point of inference - Data distribution analysis
Data duplication in different splits	Training data that’s duplicated in the validation and test data will artificially boost the performance	- Check for data duplication before and after splitting
Group leakage	Semantically duplicated data points appear in the validation and test set. Examples include the same photo of a specific subject taken milliseconds apart	- No easy way to check this except to go through data manually - tSNE may help
Leakage form data generation	The method or hardware that generated the data leaks information about the label. Common when subsets of the dataset come from different sources.	- Domain knowledge - Data normalization by source - Tracking data lineage