Grounding Big Data

Grounding Big Data Joe Hellerstein Vikram Sreekanti UC Berkeley

REMEMBERING THE PAST Data Warehouse Single Source of Truth Enterprise
Information Architecture Golden Master … Truth

REMEMBERING THE PAST “There is no point in bringing data
… into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse Truth

Big data took us to a new world

There were changes in volume, velocity and variety,   which
were challenging. Big data took us to a new world

There were changes in volume, velocity and variety,   which
were challenging. The real challenge now is the meaning and value of data,  which depend critically on context. Big data took us to a new world

A broader context for big data ground

Motivation: What is Different? Ground: Data Context Services Examples Challenges
Putting Big Data in Context OUTLINE

Metadata:   The last thing anybody   wants to work
on Isn’t this just  metadata?

Data context services:  The final frontier CONTEXT IS SO MUCH
MORE Metadata:   The last thing anybody   wants to work on

WHAT IS DIFFERENT? Shift in technology  Data representations Shift in
behavior  Data-driven organizations

Shift in behavior  Data-driven organizations

Data in products Started with the Internet. Now, the Internet
of Things

By 2017:   marketing spends more on tech than IT
does. Data in marketing GARTNER GROUP By 2020:   90% of IT budget controlled outside of IT.

MANY USE CASES MANY CONSTITUENCIES MANY INCENTIVES MANY CONTEXTS

WHAT IS DIFFERENT? Shift in technology  Data representations Shift in
behavior  Data-driven organizations

Shift in technology  Data representations

What does it  mean? It depends on  the context. Raw
data in the data lake  Simplifies capture Encourages exploration

A LITTLE SCENARIO HDFS

BITS A web log from a retail site

BITS All the web logs from last year

VIEWS, MODELS, CODE A script to extract orders. To be
used for Market Basket analysis.

VIEWS, MODELS, CODE A Hive table of orders. To be
used for Market Basket analysis.

BITS All the web logs from last year

VIEWS, MODELS, CODE Code to extract abandoned user sessions

VIEWS, MODELS, CODE A retargeting model

A hive table  of orders A retargeting model VIEWS, MODELS,
CODE

MANY SCRIPTS MANY MODELS MANY APPLICATIONS MANY CONTEXTS

Putting Big Data In Context Ground: Data Context Services Examples
Challenges Motivation: What is Different? OUTLINE

THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application
Context Views, models, code Behavioral Context Data lineage & usage  Historical Context In and over time

APPLICATION CONTEXT Metadata Models for interpreting  the data for use
• Data structures • Semantic structures • Statistical structures Theme: An unopinionated model of context

HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie
rentals Recommender for movie licensing Point in time  A promising new  movie is similar to older hot movies at time of release! Trends over time  How does a movie  with these features  fare over time?

BEHAVIORAL CONTEXT Why Dora?! Lineage & Usage

2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage
Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive  Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”

7 7 9 9 THE BIG CONTEXT A NEW WORLD
NEEDS NEW SERVICES

Putting Big Data in Context Ground: Data Context Services Examples

WHAT ARE WE BUILDING? Grounding philosophy • Start useful, stay
useful. • Stay general. • Design for scale.

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL
COMMON GROUND Parsing &  Featurization Catalog &  Discovery Wrangling Analytics &  Vis Reference  Data Data  Quality Reproducibility Model  Serving Scavenging  and Ingestion Search &  Query Scheduling &  Workflow Versioned  Storage ID & Auth

Scavenging  and Ingestion Search &  Query Scheduling &  Workflow Versioned 
Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing &  Featurization Catalog &  Discovery Wrangling Analytics &  Vis Reference  Data Data  Quality Reproducibility Model  Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND

COMMON GROUND An unopinionated context model Versions Models Usage

COMMON GROUND The metamodel Models Versions Usage Versions Usage Models
Model Graphs

member k1 member k1: string member k2 Object 2 member
k1 member k2:  number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models

COMMON GROUND The versioning model Models Versions Usage Models Versions
Usage Versions Usage Models Model Graphs Version Graphs

COMMON GROUND The model Models Versions Usage Models Versions Usage
Versions Usage Models Model Graphs Version Graphs

a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0 0e9233e8e99cccd6861d304968efa4c945a0b918 3e64220f08374629ad43ca652d4ce7cef0bdbbca 3e0bada008655fe32d7d136eac0a3f333d23ed80 fd75a4ba16f96d11f3f954854acc2d739054233 Directed Acyclic Graphs  (partial orders)
In this order In no particular order VERSION GRAPHS Models Versions Usage Models Versions Usage Versions Usage Models

COMMON GROUND The usage model Models Versions Usage Models Versions
Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs Usage Graphs: Lineage

USAGE GRAPHS Everything can participate in usage Models Versions Usage
Models Versions Usage Models Versions Usage Versions Usage Models

COMMON GROUND The model Versions Models Usage Model Graphs Version
Graphs Usage Graphs: Lineage

Putting Big Data in Context Ground: Data Context Services Examples

GROUND vZERO Goals  Exercise flexibility of Common Ground  Proof of
concept for starting useful   Examples Grit: Ground+Git Apiary: A Grounded Hive Metastore

GRIT SCENARIO CS 186 @ Berkeley 500 students • Now
the 6th-largest upper-division course at Berkeley CS 186 homework submissions through Github Goal: Track students’ submission history. E.g: • Track tardiness • Prevent submission time spoofing • Long-term: Analyze homework turn-in patterns gr

TECHNICAL ISSUES Ideally, you track git history as it’s created
• GitHub’s Webhooks API! • Reports back to Ground on every push.  Unfortunately, some wrinkles here • Webhooks API doesn’t report the full version lineage • So can’t rely on GitHub. Track the git repo ourselves. • A topic for future collaboration perhaps. (FWIW, Google Docs is even messier!) gr

APIARY A Grounded Hive Metastore Schema 1 Table 1 Column
1 Column c Table t Column 1 Column d foreign key Apiary  Ground as the backing store for Hive Metastore Relational catalog a design pattern above our basic context model Hive Metastore  the de facto catalog for structured big data

PUTTING IT TOGETHER APIARY + HDFS + Fully-versioned context store!
versioned metadata storage append-mostly data storage versioned code ground

APIARY + GRIT + HDFS Different versions of data (HDFS)
Different versions of code (Grit) + + = Different versions of metadata (Apiary)

OUTLINE Putting Big Data in Context Ground: Data Context Services
Examples Challenges Motivation: What is Different?

INITIAL FOCUS AREAS

COMMON GROUND Parsing &  Featurization Catalog &  Discovery Wrangling Analytics &  Vis Reference  Data Data  Quality Reproducibility Model  Serving Scavenging  and Ingestion Search &  Query Scheduling &  Workflow Versioned  Storage ID & Auth INITIAL FOCUS AREAS

COMMON GROUND Catalog &  Discovery Wrangling Analytics &  Vis Reference  Data Data  Quality Scavenging  and Ingestion Search &  Query Scheduling &  Workflow Versioned  Storage ID & Auth INITIAL FOCUS AREAS Parsing &  Featurization Model  Serving Reproducibility

COMMON GROUND Parsing &  Featurization Catalog &  Discovery Wrangling Analytics &  Vis Reference  Data Data  Quality Reproducibility Model  Serving Scavenging  and Ingestion Search &  Query Scheduling &  Workflow ID & Auth INITIAL FOCUS AREAS Versioned  Storage

COMMON GROUND Parsing &  Featurization Catalog &  Discovery Wrangling Analytics &  Vis Reference  Data Data  Quality Reproducibility Model  Serving Scavenging  and Ingestion Search &  Query Scheduling &  Workflow Versioned  Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES BROAD ARENA FOR INNOVATION AT ALL LEVELS!

It’s time to establish a bigger context for big data.
Historical context  Because  things change Behavioral context  Because behavior determines meaning Application context Because truth  is subjective GROUNDING BIG DATA WITH CONTEXT SERVICES

INPUT AND FURTHER CONTEXT

ground Learn more at: http://www.ground-metadata.org @joe_hellerstein @vsreekanti

Grounding Big Data

Grounding Big Data

More Decks by Vikram Sreekanti

Other Decks in Technology

Featured

Transcript