Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Grounding Big Data

Grounding Big Data

We present the motivation behind building an open-source data context service in the big data ecosystem and discuss our initial work on the Ground project at U.C. Berkeley.

Vikram Sreekanti

March 30, 2016
Tweet

More Decks by Vikram Sreekanti

Other Decks in Technology

Transcript

  1. REMEMBERING THE PAST Data Warehouse Single Source of Truth Enterprise

    Information Architecture Golden Master … Truth
  2. REMEMBERING THE PAST “There is no point in bringing data

    … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse Truth
  3. There were changes in volume, velocity and variety, 
 which

    were challenging. Big data took us to a new world
  4. There were changes in volume, velocity and variety, 
 which

    were challenging. The real challenge now is the meaning and value of data,
 which depend critically on context. Big data took us to a new world
  5. Metadata: 
 The last thing anybody 
 wants to work

    on Isn’t this just
 metadata?
  6. Data context services:
 The final frontier CONTEXT IS SO MUCH

    MORE Metadata: 
 The last thing anybody 
 wants to work on
  7. By 2017: 
 marketing spends more on tech than IT

    does. Data in marketing GARTNER GROUP By 2020: 
 90% of IT budget controlled outside of IT.
  8. What does it
 mean? It depends on
 the context. Raw

    data in the data lake
 Simplifies capture Encourages exploration
  9. VIEWS, MODELS, CODE A Hive table of orders. To be

    used for Market Basket analysis.
  10. Putting Big Data In Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  11. THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application

    Context Views, models, code Behavioral Context Data lineage & usage
 Historical Context In and over time
  12. APPLICATION CONTEXT Metadata Models for interpreting
 the data for use

    • Data structures • Semantic structures • Statistical structures Theme: An unopinionated model of context
  13. HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie

    rentals Recommender for movie licensing Point in time
 A promising new
 movie is similar to older hot movies at time of release! Trends over time
 How does a movie
 with these features
 fare over time?
  14. 2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage

    Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive
 Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”
  15. Putting Big Data in Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  16. WHAT ARE WE BUILDING? Grounding philosophy • Start useful, stay

    useful. • Stay general. • Design for scale.
  17. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth
  18. Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned


    Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND
  19. member k1 member k1: string member k2 Object 2 member

    k1 member k2:
 number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models
  20. COMMON GROUND The versioning model Models Versions Usage Models Versions

    Usage Versions Usage Models Model Graphs Version Graphs
  21. COMMON GROUND The model Models Versions Usage Models Versions Usage

    Versions Usage Models Model Graphs Version Graphs
  22. COMMON GROUND The usage model Models Versions Usage Models Versions

    Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs Usage Graphs: Lineage
  23. USAGE GRAPHS Everything can participate in usage Models Versions Usage

    Models Versions Usage Models Versions Usage Versions Usage Models
  24. Putting Big Data in Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  25. GROUND vZERO Goals
 Exercise flexibility of Common Ground
 Proof of

    concept for starting useful 
 Examples Grit: Ground+Git Apiary: A Grounded Hive Metastore
  26. GRIT SCENARIO CS 186 @ Berkeley 500 students • Now

    the 6th-largest upper-division course at Berkeley CS 186 homework submissions through Github Goal: Track students’ submission history. E.g: • Track tardiness • Prevent submission time spoofing • Long-term: Analyze homework turn-in patterns gr
  27. TECHNICAL ISSUES Ideally, you track git history as it’s created

    • GitHub’s Webhooks API! • Reports back to Ground on every push.
 Unfortunately, some wrinkles here • Webhooks API doesn’t report the full version lineage • So can’t rely on GitHub. Track the git repo ourselves. • A topic for future collaboration perhaps. (FWIW, Google Docs is even messier!) gr
  28. APIARY A Grounded Hive Metastore Schema 1 Table 1 Column

    1 Column c Table t Column 1 Column d foreign key Apiary
 Ground as the backing store for Hive Metastore Relational catalog a design pattern above our basic context model Hive Metastore
 the de facto catalog for structured big data
  29. PUTTING IT TOGETHER APIARY + HDFS + Fully-versioned context store!

    versioned metadata storage append-mostly data storage versioned code ground
  30. APIARY + GRIT + HDFS Different versions of data (HDFS)

    Different versions of code (Grit) + + = Different versions of metadata (Apiary)
  31. OUTLINE Putting Big Data in Context Ground: Data Context Services

    Examples Challenges Motivation: What is Different?
  32. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS
  33. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS Parsing &
 Featurization Model
 Serving Reproducibility
  34. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow ID & Auth INITIAL FOCUS AREAS Versioned
 Storage
  35. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES BROAD ARENA FOR INNOVATION AT ALL LEVELS!
  36. It’s time to establish a bigger context for big data.

    Historical context
 Because
 things change Behavioral context
 Because behavior determines meaning Application context Because truth
 is subjective GROUNDING BIG DATA WITH CONTEXT SERVICES