Data set A Data set B Data set C Data set D Data set E Feature Eng 1 Feature Eng 2 Feature Eng 3 Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature engineering is in a silo - no reuse between model builds
Data set A Data set B Data set C Data set D Data set E Feature Eng A Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature Eng B Feature Eng C Feature Eng D Feature Eng E Feature engineering is done once Features are reused across model builds
Extensible dictionary • Support for rich attribute types (e.g. structs, arrays) • Arbitrary attribute metadata • Specification of valid attribute values • e.g. ‘M’ and ‘F’ only for gender • Improved validation • Improved on-disk representation • Useful for downstream applications, e.g. plots
Lazy feature generation • Lazily generate features derived from existing facts on extract (chord/snapshot) • Derived “meta” features (i.e. ‘select’) • Windowing functions (e.g. “average over last 3 months”) • Row-level features
Data set A Data set B Data set C Data set D Data set E Train Model 1 Score Train Model 2 Score Train Model 3 Score Modelling in silos - models built and deployed in isolation Feature Store ! (Ivory) Feature engineering is integrated and lazily generated on extraction Source data loaded directly Feature Eng ! (Ivory)
Repository forking • Low-cost cloning/forking of repositories: • “master” production repo • “experimental” cloned repo • Allow a data scientist to join production features with their own without affecting production operations