1. Avoid using too many features
  2. Selecting important features
  3. Selecting features that generalize to unseen data

Why you shouldn’t use too many features

Too many features:

  • increases the possibility for data leakage.
  • can cause overfitting
  • can increase memory requirements during training and inference
  • increases inference latency
  • increases technical debt
    • an outdated feature can affect model performance
    • when deprecating a useless feature from the model, any features that depend on it also needs to be adjusted

Selecting important features

Use plotting tools like

  • SHAP plots
  • Feature importance plots (if provided by the model package)

Selecting features that generalize to unseen data

AspectDescription
CoverageThe feature should be available for most of your data. An exception to this is if a feature has high predictive power when present, and is confirmed to not be a leaky feature.
Value distributionThe distribution of the feature value should be the same between training vs val/test/inference sets.1
Availability at inference timeThe feature should be available at inference time

Footnotes

  1. See Data Drift. ↩