Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Grounding Big Data

Grounding Big Data

We present the motivation behind building an open-source data context service in the big data ecosystem and discuss our initial work on the Ground project at U.C. Berkeley.

Vikram Sreekanti

March 30, 2016
Tweet

More Decks by Vikram Sreekanti

Other Decks in Technology

Transcript

  1. Grounding Big Data
    Joe Hellerstein
    Vikram Sreekanti
    UC Berkeley

    View Slide

  2. REMEMBERING THE PAST
    Data Warehouse
    Single Source of Truth
    Enterprise Information Architecture
    Golden Master

    Truth

    View Slide

  3. REMEMBERING THE PAST
    “There is no point in bringing data … into the data warehouse environment
    without integrating it.”
    — Bill Inmon, Building the Data Warehouse
    Truth

    View Slide

  4. Big data took us to a new world

    View Slide

  5. There were changes in volume, velocity and variety, 

    which were challenging.
    Big data took us to a new world

    View Slide

  6. There were changes in volume, velocity and variety, 

    which were challenging.
    The real challenge now is the meaning and value of data,

    which depend critically on context.
    Big data took us to a new world

    View Slide

  7. A broader context for big data
    ground

    View Slide

  8. Motivation: What is Different?
    Ground: Data Context Services
    Examples
    Challenges
    Putting Big Data in Context
    OUTLINE

    View Slide

  9. Metadata: 

    The last thing anybody 

    wants to work on
    Isn’t this just

    metadata?

    View Slide

  10. Data context services:

    The final frontier
    CONTEXT IS SO MUCH MORE
    Metadata: 

    The last thing anybody 

    wants to work on

    View Slide

  11. WHAT IS DIFFERENT?
    Shift in technology

    Data representations
    Shift in behavior

    Data-driven organizations

    View Slide

  12. Shift in behavior

    Data-driven organizations

    View Slide

  13. Data in products
    Started with the Internet.
    Now, the Internet of Things

    View Slide

  14. By 2017: 

    marketing spends more on tech than IT does.
    Data in marketing
    GARTNER GROUP
    By 2020: 

    90% of IT budget controlled outside of IT.

    View Slide

  15. MANY USE CASES
    MANY CONSTITUENCIES
    MANY INCENTIVES
    MANY CONTEXTS

    View Slide

  16. WHAT IS DIFFERENT?
    Shift in technology

    Data representations
    Shift in behavior

    Data-driven organizations

    View Slide

  17. Shift in technology

    Data representations

    View Slide

  18. What does it

    mean?
    It depends on

    the context.
    Raw data in the data lake

    Simplifies capture
    Encourages exploration

    View Slide

  19. A LITTLE SCENARIO
    HDFS

    View Slide

  20. BITS
    A web log from a retail site

    View Slide

  21. BITS
    All the web logs from last year

    View Slide

  22. VIEWS, MODELS, CODE
    A script to extract orders. To be used for Market Basket analysis.

    View Slide

  23. VIEWS, MODELS, CODE
    A Hive table of orders. To be used for Market Basket analysis.

    View Slide

  24. BITS
    All the web logs from last year

    View Slide

  25. VIEWS, MODELS, CODE
    Code to extract abandoned user sessions

    View Slide

  26. VIEWS, MODELS, CODE
    A retargeting model

    View Slide

  27. A hive table

    of orders
    A retargeting
    model
    VIEWS, MODELS, CODE

    View Slide

  28. View Slide

  29. MANY SCRIPTS
    MANY MODELS
    MANY APPLICATIONS
    MANY CONTEXTS

    View Slide

  30. Putting Big Data In Context
    Ground: Data Context Services
    Examples
    Challenges
    Motivation: What is Different?
    OUTLINE

    View Slide

  31. THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT
    Application Context
    Views, models, code
    Behavioral Context
    Data lineage & usage

    Historical Context
    In and over time

    View Slide

  32. APPLICATION CONTEXT
    Metadata
    Models for interpreting

    the data for use
    • Data structures
    • Semantic structures
    • Statistical structures
    Theme: An unopinionated model of context

    View Slide

  33. HISTORICAL CONTEXT
    Versions
    Web logs Code to extract user/
    movie rentals
    Recommender for
    movie licensing
    Point in time

    A promising new

    movie is similar to
    older hot movies at
    time of release!
    Trends over time

    How does a movie

    with these features

    fare over time?

    View Slide

  34. BEHAVIORAL CONTEXT
    Why Dora?!
    Lineage & Usage

    View Slide

  35. 2 4 8 7 9
    BEHAVIORAL CONTEXT
    Lineage & Usage
    Data Science
    Recommenders
    “You should compare
    with book sales from
    last year.”
    Curation Tips
    “Logistics staff checks
    weather data the 1st
    Monday of every
    month.”
    Proactive

    Impact Analysis
    “The Twitter analysis
    script changed. You
    should check the boss’
    dashboard!”

    View Slide

  36. 7
    7
    9
    9
    THE BIG CONTEXT
    A NEW WORLD NEEDS NEW SERVICES

    View Slide

  37. Putting Big Data in Context
    Ground: Data Context Services
    Examples
    Challenges
    Motivation: What is Different?
    OUTLINE

    View Slide

  38. WHAT ARE WE BUILDING?
    Grounding philosophy
    • Start useful, stay useful.
    • Stay general.
    • Design for scale.

    View Slide

  39. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth

    View Slide

  40. Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth
    COMMON GROUND CONTEXT MODEL
    Pachyderm Chronos
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND

    View Slide

  41. COMMON GROUND
    An unopinionated context model
    Versions
    Models
    Usage

    View Slide

  42. COMMON GROUND
    The metamodel
    Models
    Versions
    Usage
    Versions
    Usage
    Models
    Model Graphs

    View Slide

  43. member k1
    member k1:
    string
    member k2
    Object 2
    member k1
    member k2:

    number
    member k11:
    string member k12
    element 1 element 2 element 3
    element 1 element 2 element 3
    Root
    RELATIONAL SCHEMA
    JSON DOCUMENT
    Schema 1
    Table 1
    Column 1 Column c
    Table t
    Column 1 Column d
    foreign key
    Models
    Versions
    Usage
    Versions
    Usage
    Models

    View Slide

  44. COMMON GROUND
    The versioning model
    Models
    Versions
    Usage
    Models
    Versions
    Usage
    Versions
    Usage
    Models
    Model Graphs
    Version Graphs

    View Slide

  45. COMMON GROUND
    The model
    Models
    Versions
    Usage
    Models
    Versions
    Usage
    Versions
    Usage
    Models
    Model Graphs
    Version Graphs

    View Slide

  46. a3eb4b765520b0d0ab90594dcf2373c1ce5dbb0b0
    0e9233e8e99cccd6861d304968efa4c945a0b918
    3e64220f08374629ad43ca652d4ce7cef0bdbbca
    3e0bada008655fe32d7d136eac0a3f333d23ed80
    fd75a4ba16f96d11f3f954854acc2d739054233
    Directed Acyclic Graphs

    (partial orders)
    In this order
    In no particular order
    VERSION GRAPHS
    Models
    Versions
    Usage
    Models
    Versions
    Usage
    Versions
    Usage
    Models

    View Slide

  47. COMMON GROUND
    The usage model
    Models
    Versions
    Usage
    Models
    Versions
    Usage
    Models
    Versions
    Usage
    Versions
    Usage
    Models
    Model Graphs
    Version Graphs
    Usage Graphs: Lineage

    View Slide

  48. USAGE GRAPHS
    Everything can participate in usage
    Models
    Versions
    Usage
    Models
    Versions
    Usage
    Models
    Versions
    Usage
    Versions
    Usage
    Models

    View Slide

  49. COMMON GROUND
    The model
    Versions
    Models
    Usage
    Model Graphs
    Version Graphs
    Usage Graphs: Lineage

    View Slide

  50. Putting Big Data in Context
    Ground: Data Context Services
    Examples
    Challenges
    Motivation: What is Different?
    OUTLINE

    View Slide

  51. GROUND vZERO
    Goals

    Exercise flexibility of Common Ground

    Proof of concept for starting useful

    Examples
    Grit: Ground+Git
    Apiary: A Grounded Hive Metastore

    View Slide

  52. GRIT SCENARIO
    CS 186 @ Berkeley
    500 students
    • Now the 6th-largest upper-division course at Berkeley
    CS 186 homework submissions through Github
    Goal: Track students’ submission history. E.g:
    • Track tardiness
    • Prevent submission time spoofing
    • Long-term: Analyze homework turn-in patterns
    gr

    View Slide

  53. TECHNICAL ISSUES
    Ideally, you track git history as it’s created
    • GitHub’s Webhooks API!
    • Reports back to Ground on every push.

    Unfortunately, some wrinkles here
    • Webhooks API doesn’t report the full version lineage
    • So can’t rely on GitHub. Track the git repo ourselves.
    • A topic for future collaboration perhaps.
    (FWIW, Google Docs is even messier!)
    gr

    View Slide

  54. APIARY
    A Grounded Hive Metastore
    Schema 1
    Table 1
    Column 1 Column c
    Table t
    Column 1 Column d
    foreign key
    Apiary

    Ground as the
    backing store for
    Hive Metastore
    Relational catalog
    a design pattern
    above our basic
    context model
    Hive Metastore

    the de facto catalog
    for structured big
    data

    View Slide

  55. PUTTING IT TOGETHER
    APIARY
    + HDFS
    +
    Fully-versioned context store!
    versioned metadata storage
    append-mostly data storage
    versioned code
    ground

    View Slide

  56. APIARY + GRIT + HDFS
    Different versions
    of data (HDFS)
    Different versions
    of code (Grit)
    + +
    =
    Different versions
    of metadata (Apiary)

    View Slide

  57. DEMO

    View Slide

  58. OUTLINE
    Putting Big Data in Context
    Ground: Data Context Services
    Examples
    Challenges
    Motivation: What is Different?

    View Slide

  59. INITIAL FOCUS AREAS

    View Slide

  60. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth
    INITIAL FOCUS AREAS

    View Slide

  61. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth
    INITIAL FOCUS AREAS
    Parsing &

    Featurization
    Model

    Serving
    Reproducibility

    View Slide

  62. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    ID & Auth
    INITIAL FOCUS AREAS
    Versioned

    Storage

    View Slide

  63. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    BROAD ARENA FOR INNOVATION AT ALL LEVELS!

    View Slide

  64. It’s time to establish a bigger context for big data.
    Historical context

    Because

    things change
    Behavioral context

    Because behavior
    determines meaning
    Application context
    Because truth

    is subjective
    GROUNDING BIG DATA WITH CONTEXT SERVICES

    View Slide

  65. INPUT AND FURTHER CONTEXT

    View Slide

  66. ground
    Learn more at:
    http://www.ground-metadata.org
    @joe_hellerstein
    @vsreekanti

    View Slide