We present the motivation behind building an open-source data context service in the big data ecosystem and discuss our initial work on the Ground project at U.C. Berkeley.
REMEMBERING THE PAST “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse Truth
There were changes in volume, velocity and variety, which were challenging. The real challenge now is the meaning and value of data, which depend critically on context. Big data took us to a new world
THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application Context Views, models, code Behavioral Context Data lineage & usage Historical Context In and over time
APPLICATION CONTEXT Metadata Models for interpreting the data for use • Data structures • Semantic structures • Statistical structures Theme: An unopinionated model of context
HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie rentals Recommender for movie licensing Point in time A promising new movie is similar to older hot movies at time of release! Trends over time How does a movie with these features fare over time?
2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”
ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth
Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND
member k1 member k1: string member k2 Object 2 member k1 member k2: number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models
a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0 0e9233e8e99cccd6861d304968efa4c945a0b918 3e64220f08374629ad43ca652d4ce7cef0bdbbca 3e0bada008655fe32d7d136eac0a3f333d23ed80 fd75a4ba16f96d11f3f954854acc2d739054233 Directed Acyclic Graphs (partial orders) In this order In no particular order VERSION GRAPHS Models Versions Usage Models Versions Usage Versions Usage Models
COMMON GROUND The usage model Models Versions Usage Models Versions Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs Usage Graphs: Lineage
TECHNICAL ISSUES Ideally, you track git history as it’s created • GitHub’s Webhooks API! • Reports back to Ground on every push. Unfortunately, some wrinkles here • Webhooks API doesn’t report the full version lineage • So can’t rely on GitHub. Track the git repo ourselves. • A topic for future collaboration perhaps. (FWIW, Google Docs is even messier!) gr
APIARY A Grounded Hive Metastore Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Apiary Ground as the backing store for Hive Metastore Relational catalog a design pattern above our basic context model Hive Metastore the de facto catalog for structured big data
ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth INITIAL FOCUS AREAS
ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth INITIAL FOCUS AREAS Parsing & Featurization Model Serving Reproducibility
ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow ID & Auth INITIAL FOCUS AREAS Versioned Storage
ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Reference Data Data Quality Reproducibility Model Serving Scavenging and Ingestion Search & Query Scheduling & Workflow Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES BROAD ARENA FOR INNOVATION AT ALL LEVELS!
It’s time to establish a bigger context for big data. Historical context Because things change Behavioral context Because behavior determines meaning Application context Because truth is subjective GROUNDING BIG DATA WITH CONTEXT SERVICES