$30 off During Our Annual Pro Sale. View Details »

Grounding Big Data

Grounding Big Data

Talk at Strata San Jose 2016 on Context Services for Big Data in general, and the Ground project at Berkeley in particular.

Joe Hellerstein

April 01, 2016
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. REMEMBERING THE PAST Data Warehouse Single Source of Truth Enterprise

    Information Architecture Golden Master … Truth
  2. REMEMBERING THE PAST “There is no point in bringing data

    … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse Truth
  3. There were changes in volume, velocity and variety, 
 which

    were challenging. Big data took us to a new world
  4. There were changes in volume, velocity and variety, 
 which

    were challenging. The real challenge now is the meaning and value of data,
 which depend critically on context. Big data took us to a new world
  5. Metadata: 
 The last thing anybody 
 wants to work

    on Isn’t this just
 metadata?
  6. By 2017: 
 marketing spends more on tech than IT.

    Data in marketing GARTNER GROUP
  7. By 2017: 
 marketing spends more on tech than IT.

    Data in marketing GARTNER GROUP By 2020: 
 90% of IT budget controlled outside of IT.
  8. What does it
 mean? Raw data in the data lake


    Simplifies capture Encourages exploration
  9. What does it
 mean? It depends on
 the context. Raw

    data in the data lake
 Simplifies capture Encourages exploration
  10. VIEWS, MODELS, CODE A Hive table of orders. To be

    used for Market Basket analysis.
  11. VIEWS, MODELS, CODE A Hive table of orders. To be

    used for Market Basket analysis.
  12. Putting Big Data In Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  13. THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application

    Context Views, models, code Behavioral Context Data lineage & usage
 Historical Context In and over time
  14. APPLICATION CONTEXT Metadata Models for interpreting
 the data for use

    • Data structures • Semantic structures • Statistical structures Theme: An unopinionated model of context
  15. HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie

    rentals Recommender for movie licensing Point in time
 A promising new
 movie is similar to older hot movies at time of release!
  16. HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie

    rentals Recommender for movie licensing Point in time
 A promising new
 movie is similar to older hot movies at time of release! Trends over time
 How does a movie
 with these features
 fare over time?
  17. 2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage

    Data Science Recommenders “You should compare with book sales from last year.”
  18. 2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage

    Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.”
  19. 2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage

    Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive
 Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”
  20. Putting Big Data in Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  21. WHAT ARE WE BUILDING? Grounding philosophy • Start useful, stay

    useful. • Stay general. • Design for scale.
  22. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth
  23. Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned


    Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND
  24. member k1 member k1: string member k2 Object 2 member

    k1 member k2:
 number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models
  25. COMMON GROUND The versioning model Models Versions Usage Models Versions

    Usage Versions Usage Models Model Graphs Version Graphs
  26. COMMON GROUND The model Models Versions Usage Models Versions Usage

    Versions Usage Models Model Graphs Version Graphs
  27. COMMON GROUND The usage model Models Versions Usage Models Versions

    Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs Usage Graphs: Lineage
  28. USAGE GRAPHS Everything can participate in usage Models Versions Usage

    Models Versions Usage Models Versions Usage Versions Usage Models
  29. USAGE GRAPHS Everything can participate in usage Models Versions Usage

    Models Versions Usage Models Versions Usage Versions Usage Models
  30. USAGE GRAPHS Everything can participate in usage Models Versions Usage

    Models Versions Usage Models Versions Usage Versions Usage Models
  31. Putting Big Data in Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  32. GROUND vZERO Goals
 Exercise flexibility of Common Ground
 Proof of

    concept for starting useful 
 Examples Grit: Ground+Git Apiary: A Grounded Hive Metastore
  33. GRIT SCENARIO CS 186 @ Berkeley 500 students • Now

    the 6th-largest upper-division course at Berkeley CS 186 homework submissions through Github Goal: Track students’ submission history. E.g: • Track tardiness • Prevent submission time spoofing • Long-term: Analyze homework turn-in patterns gr
  34. TECHNICAL ISSUES Ideally, you track git history as it’s created

    •GitHub’s Webhooks API! •Reports back to Ground on every push.
 Unfortunately, some wrinkles here •Webhooks API doesn’t report the full version lineage •So can’t rely on GitHub. Track the git repo ourselves. •A topic for future collaboration perhaps. (FWIW, Google Docs is even messier!) gr
  35. APIARY A Grounded Hive Metastore Schema 1 Table 1 Column

    1 Column c Table t Column 1 Column d foreign key Apiary
 Ground as the backing store for Hive Metastore Relational catalog a design pattern above our basic context model Hive Metastore
 the de facto catalog for structured big data
  36. PUTTING IT TOGETHER APIARY + HDFS + Fully-versioned context store!

    versioned metadata storage append-mostly data storage versioned code ground
  37. OUTLINE Putting Big Data in Context Ground: Data Context Services

    Examples Challenges Motivation: What is Different?
  38. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS
  39. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS Parsing &
 Featurization Model
 Serving Reproducibility
  40. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow ID & Auth INITIAL FOCUS AREAS Versioned
 Storage
  41. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES BROAD ARENA FOR INNOVATION AT ALL LEVELS!
  42. It’s time to establish a bigger context for big data.

    Historical context
 Because
 things change Behavioral context
 Because behavior determines meaning Application context Because truth
 is subjective GROUNDING BIG DATA WITH CONTEXT SERVICES