- Avoid using too many features
- Selecting important features
- Selecting features that generalize to unseen data
Why you shouldn’t use too many features
Too many features:
- increases the possibility for data leakage.
- can cause overfitting
- can increase memory requirements during training and inference
- increases inference latency
- increases technical debt
- an outdated feature can affect model performance
- when deprecating a useless feature from the model, any features that depend on it also needs to be adjusted
Selecting important features
Use plotting tools like
- SHAP plots
- Feature importance plots (if provided by the model package)
Selecting features that generalize to unseen data
Aspect | Description |
---|---|
Coverage | The feature should be available for most of your data. An exception to this is if a feature has high predictive power when present, and is confirmed to not be a leaky feature. |
Value distribution | The distribution of the feature value should be the same between training vs val/test/inference sets.1 |
Availability at inference time | The feature should be available at inference time |
Footnotes
-
See Data Drift. ↩