Talk given by Ben Lever at Strata+Hadoop World in London, 6 May 2015:
Feature engineering is a critical and time-consuming activity in the development and deployment of any modeling pipeline. It is also exacerbated as data science teams seek to incorporate new data sources into their pipelines that are at a scale far larger than previously employed. Furthermore, the transition to production environments is littered with complexity as these pipelines are exposed to the dynamic, and fragile, world of ongoing data feeds, data corrections, and evolving data models.
In this talk we will introduce Ivory, a new open-source, Hadoop-based data store that seeks to address these challenges. Ivory is a scalable and extensible data store for storing facts and extracting features. It is optimised specifically for the feature engineering stages of modelling pipelines, simultaneously simplifying and adding rigour to them.
This session will walk through an example of how Ivory can be used in the typical data scientist’s workflow, and then how that extends to migrating pipelines into production. It will impart all of the basic concepts of Ivory such as repositories, the dictionary, its fact-based data model, and virtual features. It will also demonstrate the benefits of Ivory being an immutable data store and the unique opportunities that creates.