Save 37% off PRO during our Black Friday Sale! »

Grounding Big Data

Grounding Big Data

We present the motivation behind building an open-source data context service in the big data ecosystem and discuss our initial work on the Ground project at U.C. Berkeley.

6fb14b984cce53b9229239c83ea94424?s=128

Vikram Sreekanti

March 30, 2016
Tweet

Transcript

  1. Grounding Big Data Joe Hellerstein Vikram Sreekanti UC Berkeley

  2. REMEMBERING THE PAST Data Warehouse Single Source of Truth Enterprise

    Information Architecture Golden Master … Truth
  3. REMEMBERING THE PAST “There is no point in bringing data

    … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse Truth
  4. Big data took us to a new world

  5. There were changes in volume, velocity and variety, 
 which

    were challenging. Big data took us to a new world
  6. There were changes in volume, velocity and variety, 
 which

    were challenging. The real challenge now is the meaning and value of data,
 which depend critically on context. Big data took us to a new world
  7. A broader context for big data ground

  8. Motivation: What is Different? Ground: Data Context Services Examples Challenges

    Putting Big Data in Context OUTLINE
  9. Metadata: 
 The last thing anybody 
 wants to work

    on Isn’t this just
 metadata?
  10. Data context services:
 The final frontier CONTEXT IS SO MUCH

    MORE Metadata: 
 The last thing anybody 
 wants to work on
  11. WHAT IS DIFFERENT? Shift in technology
 Data representations Shift in

    behavior
 Data-driven organizations
  12. Shift in behavior
 Data-driven organizations

  13. Data in products Started with the Internet. Now, the Internet

    of Things
  14. By 2017: 
 marketing spends more on tech than IT

    does. Data in marketing GARTNER GROUP By 2020: 
 90% of IT budget controlled outside of IT.
  15. MANY USE CASES MANY CONSTITUENCIES MANY INCENTIVES MANY CONTEXTS

  16. WHAT IS DIFFERENT? Shift in technology
 Data representations Shift in

    behavior
 Data-driven organizations
  17. Shift in technology
 Data representations

  18. What does it
 mean? It depends on
 the context. Raw

    data in the data lake
 Simplifies capture Encourages exploration
  19. A LITTLE SCENARIO HDFS

  20. BITS A web log from a retail site

  21. BITS All the web logs from last year

  22. VIEWS, MODELS, CODE A script to extract orders. To be

    used for Market Basket analysis.
  23. VIEWS, MODELS, CODE A Hive table of orders. To be

    used for Market Basket analysis.
  24. BITS All the web logs from last year

  25. VIEWS, MODELS, CODE Code to extract abandoned user sessions

  26. VIEWS, MODELS, CODE A retargeting model

  27. A hive table
 of orders A retargeting model VIEWS, MODELS,

    CODE
  28. None
  29. MANY SCRIPTS MANY MODELS MANY APPLICATIONS MANY CONTEXTS

  30. Putting Big Data In Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  31. THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT Application

    Context Views, models, code Behavioral Context Data lineage & usage
 Historical Context In and over time
  32. APPLICATION CONTEXT Metadata Models for interpreting
 the data for use

    • Data structures • Semantic structures • Statistical structures Theme: An unopinionated model of context
  33. HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie

    rentals Recommender for movie licensing Point in time
 A promising new
 movie is similar to older hot movies at time of release! Trends over time
 How does a movie
 with these features
 fare over time?
  34. BEHAVIORAL CONTEXT Why Dora?! Lineage & Usage

  35. 2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage

    Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive
 Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”
  36. 7 7 9 9 THE BIG CONTEXT A NEW WORLD

    NEEDS NEW SERVICES
  37. Putting Big Data in Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  38. WHAT ARE WE BUILDING? Grounding philosophy • Start useful, stay

    useful. • Stay general. • Design for scale.
  39. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth
  40. Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned


    Storage ID & Auth COMMON GROUND CONTEXT MODEL Pachyderm Chronos Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND
  41. COMMON GROUND An unopinionated context model Versions Models Usage

  42. COMMON GROUND The metamodel Models Versions Usage Versions Usage Models

    Model Graphs
  43. member k1 member k1: string member k2 Object 2 member

    k1 member k2:
 number member k11: string member k12 element 1 element 2 element 3 element 1 element 2 element 3 Root RELATIONAL SCHEMA JSON DOCUMENT Schema 1 Table 1 Column 1 Column c Table t Column 1 Column d foreign key Models Versions Usage Versions Usage Models
  44. COMMON GROUND The versioning model Models Versions Usage Models Versions

    Usage Versions Usage Models Model Graphs Version Graphs
  45. COMMON GROUND The model Models Versions Usage Models Versions Usage

    Versions Usage Models Model Graphs Version Graphs
  46. a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0 0e9233e8e99cccd6861d304968efa4c945a0b918 3e64220f08374629ad43ca652d4ce7cef0bdbbca 3e0bada008655fe32d7d136eac0a3f333d23ed80 fd75a4ba16f96d11f3f954854acc2d739054233 Directed Acyclic Graphs
 (partial orders)

    In this order In no particular order VERSION GRAPHS Models Versions Usage Models Versions Usage Versions Usage Models
  47. COMMON GROUND The usage model Models Versions Usage Models Versions

    Usage Models Versions Usage Versions Usage Models Model Graphs Version Graphs Usage Graphs: Lineage
  48. USAGE GRAPHS Everything can participate in usage Models Versions Usage

    Models Versions Usage Models Versions Usage Versions Usage Models
  49. COMMON GROUND The model Versions Models Usage Model Graphs Version

    Graphs Usage Graphs: Lineage
  50. Putting Big Data in Context Ground: Data Context Services Examples

    Challenges Motivation: What is Different? OUTLINE
  51. GROUND vZERO Goals
 Exercise flexibility of Common Ground
 Proof of

    concept for starting useful 
 Examples Grit: Ground+Git Apiary: A Grounded Hive Metastore
  52. GRIT SCENARIO CS 186 @ Berkeley 500 students • Now

    the 6th-largest upper-division course at Berkeley CS 186 homework submissions through Github Goal: Track students’ submission history. E.g: • Track tardiness • Prevent submission time spoofing • Long-term: Analyze homework turn-in patterns gr
  53. TECHNICAL ISSUES Ideally, you track git history as it’s created

    • GitHub’s Webhooks API! • Reports back to Ground on every push.
 Unfortunately, some wrinkles here • Webhooks API doesn’t report the full version lineage • So can’t rely on GitHub. Track the git repo ourselves. • A topic for future collaboration perhaps. (FWIW, Google Docs is even messier!) gr
  54. APIARY A Grounded Hive Metastore Schema 1 Table 1 Column

    1 Column c Table t Column 1 Column d foreign key Apiary
 Ground as the backing store for Hive Metastore Relational catalog a design pattern above our basic context model Hive Metastore
 the de facto catalog for structured big data
  55. PUTTING IT TOGETHER APIARY + HDFS + Fully-versioned context store!

    versioned metadata storage append-mostly data storage versioned code ground
  56. APIARY + GRIT + HDFS Different versions of data (HDFS)

    Different versions of code (Grit) + + = Different versions of metadata (Apiary)
  57. DEMO

  58. OUTLINE Putting Big Data in Context Ground: Data Context Services

    Examples Challenges Motivation: What is Different?
  59. INITIAL FOCUS AREAS

  60. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS
  61. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth INITIAL FOCUS AREAS Parsing &
 Featurization Model
 Serving Reproducibility
  62. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow ID & Auth INITIAL FOCUS AREAS Versioned
 Storage
  63. ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL

    COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES BROAD ARENA FOR INNOVATION AT ALL LEVELS!
  64. It’s time to establish a bigger context for big data.

    Historical context
 Because
 things change Behavioral context
 Because behavior determines meaning Application context Because truth
 is subjective GROUNDING BIG DATA WITH CONTEXT SERVICES
  65. INPUT AND FURTHER CONTEXT

  66. ground Learn more at: http://www.ground-metadata.org @joe_hellerstein @vsreekanti