Data leakage detection

Question

How do you know if there’s leakage in training in the first place?

  1. Check modelling and featurization steps for leakage
    • ensure that features do not use held-out test statistics
    • ensure that test set isn’t used to inform hyperparameter optimization
  2. Check feature importance plot of model. Investigate features that score highly
  3. Run through failure modes below.

Failure modes

Failure ModesDescriptionMitigation
Splitting time-correlated data randomly- Understand whether data has time correlation
- Respect time correlation when splitting data
Feature engineering before splittingStatistics about the inference set are leaked into training- Best practice: Only perform featurization after splitting
- Use .fit_transform paradigm
Incorrect use of time-evolving featuresThe “snapshot” of a feature is past the point of inference, where the label is already known.
- Check if data fields changes over time for a given entity
- Do point-in-time modelling of data
- Sync data more frequently
- Use data “snapshot” at the point of inference
- Data distribution analysis
Data duplication in different splitsTraining data that’s duplicated in the validation and test data will artificially boost the performance- Check for data duplication before and after splitting
Group leakageSemantically duplicated data points appear in the validation and test set. Examples include the same photo of a specific subject taken milliseconds apart- No easy way to check this except to go through data manually
- tSNE may help
Leakage form data generationThe method or hardware that generated the data leaks information about the label.

Common when subsets of the dataset come from different sources.
- Domain knowledge
- Data normalization by source
- Tracking data lineage