Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Context 
In the Field and in the Lab

Data Context 
In the Field and in the Lab

Keynote talk, 1st Workshop on Context in Analytics, IEEE International Conference on Data Engineering (ICDE) 2018.

Joe Hellerstein

April 16, 2018
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. Data Context
    In the Field and in the Lab
    JOE HELLERSTEIN, UC BERKELEY
    1
    1ST WORKSHOP ON
    CONTEXT IN ANALYTICS,
    PARIS, APRIL 2018

    View full-size slide

  2. 2
    Perspectives on Data Context
    Wrangling Context Services
    Six years with Trifacta and
    Google Cloud Dataprep
    The Common Ground
    model, and Ground system
    Managing the rise of
    Empirical AI
    flor
    ML Lifecycle

    View full-size slide

  3. 3
    An Overview for Academics
    April 24, 2017

    View full-size slide

  4. 4
    Data Preparation:
    A key source of
    data context.
    Shifting Market
    WHO
    WHA
    T
    WHER
    E
    Problem
    Relevance
    80%
    Data Prep Market
    Use Cases
    Platform
    Challenges
    O n-Pre m
    Da ta
    A
    g
    e
    n
    t
    s
    A
    D
    L
    S
    D AT A S E CU RIT Y & ACCE S S CON T ROLS
    T RAN S P ARE N T D AT A LIN E AG E
    D AT A CAT ALOG IN T E G RAT ION
    ANALYTICS DATA SCIENCE AUTOMATION
    Perspectives on Data Context
    Wrangling Context Services
    Six years with Trifacta and
    Google Cloud Dataprep
    The Common Ground
    model, and Ground system
    Managing the rise of
    Empirical AI
    flor
    ML Lifecycle

    View full-size slide

  5. Data Preparation:
    A key source of data
    context.
    Shifting Market
    WHO
    WHAT
    WHERE
    Problem Relevance
    80%
    Data Prep Market
    Use Cases
    Platform
    Challenges
    O n-Pre m Da ta
    A
    ge
    nts
    A
    D
    LS
    D AT A S E CU RIT Y & ACCE S S CON T ROLS
    T RAN S P ARE N T D AT A LIN E AG E
    D AT A CAT ALOG IN T E G RAT ION
    ANALYTICS DATA SCIENCE AUTOMATION

    View full-size slide

  6. 1990’s: IT Governance In the Data Warehouse
    “There is no point in bringing data … into the data
    warehouse environment without integrating it.”
    — Bill Inmon, Building the Data Warehouse, 2005

    View full-size slide

  7. 2018: Business Value From A Quantified World
    “Get into the mindset to collect and measure
    everything you can.”
    — DJ Patil, Building Data Science Teams, 2011

    View full-size slide

  8. “Get into the mindset to collect and measure
    everything you can.”
    — DJ Patil, Building Data Science Teams, 2011
    “There is no point in bringing data … into the data
    warehouse environment without integrating it.”
    — Bill Inmon, Building the Data Warehouse, 2005
    2018: Business Value From A Quantified World

    View full-size slide

  9. The World is Changing
    7
    End-User
    Self-Service
    WHO
    Power Shift
    Big Data
    Analytics
    Empirical AI
    WHAT
    Data Shift
    Open Source
    Cloud WHERE
    Platform Shift

    View full-size slide

  10. Whitespace: Data Wrangling
    DATA PLATFORMS
    ANALYSIS & CONSUMPTION
    80% “It’s impossible to overstress this:
    80% of the work in any data project
    is in cleaning the data.”
    — DJ Patil, Data Jujitsu, O’Reilly Media 2012

    View full-size slide

  11. Research Roots: Open Source Data Wrangler, 2011
    + Predictive Interaction
    + Immediate feedback on
    multiple choices
    + Data quality indicators
    “Wrangler: Interactive Visual Specification of Data
    Transformation Scripts.” S. Kandel, A.Paepcke, J.M.
    Hellerstein, J. Heer. CHI 2011.
    “Proactive Wrangling: Mixed-Initiative End-User
    Programming of Data Transformation Scripts”, P.J.
    Guo, S. Kandel, J. M. Hellerstein, J. Heer. UIST 2011.
    “Enterprise Data Analysis and Visualization: An
    Interview Study.” S. Kandel, A. Paepcke, J.M.
    Hellerstein, and J. Heer IEEE VAST 2012
    “Profiler: Integrated Statistical Analysis and Visualization for Data
    Quality Assessment”, S. Kandel, et al. AVI 2012.
    “Predictive Interaction for Data Transformation”
    J. Heer, J.M. Hellerstein and S. Kandel CIDR 2015

    View full-size slide

  12. The State of the Data Prep Market, 2018
    Market category well
    established
    Forrester did its first
    “Wave” ranking in 2017
    Gartner now estimates
    > $1 Billion market for
    Data Prep by 2021
    12
    “Trifacta delivers a strong balance for
    self-service by analysts and business
    users. Customer references gave high
    marks to Trifacta’s ease of use. Trifacta
    leverages machine learning algorithms
    to automate and simplify the
    interaction with data.”

    View full-size slide

  13. Data Wrangling Standard Across Industry Leaders
    Proprietary & Confidential
    13
    Financial
    Services
    Insurance Healthcare and
    Pharmaceuticals
    Retail and
    Consumer Goods
    Government
    Agencies

    View full-size slide

  14. A DATA WRANGLING PLATFORM
    14
    On-Prem Data
    Agents
    ADLS
    DATA SECURITY & ACCESS CONTROLS
    TRANSPARENT DATA LINEAGE
    DATA CATALOG INTEGRATION

    View full-size slide

  15. A DATA WRANGLING PLATFORM
    15
    On-Prem Data
    Agents
    ADLS
    DATA SECURITY & ACCESS CONTROLS
    TRANSPARENT DATA LINEAGE
    DATA CATALOG INTEGRATION
    Millions of lines of recipes
    Multi-terabyte flows
    24x7 Elasticity & Scalability

    View full-size slide

  16. CDC AIDS Intervention in Indiana

    View full-size slide

  17. The CDC reduced data preparation
    from 3 months to 3 days with Trifacta

    Refuted an assumption in analysis that
    would not have been possible without
    enriching datasets

    Expects to scale this model to other,
    similar outbreaks, such as Zika or Ebola
    Benefits
    In future, we need to combine
    “a variety of sources to identify
    jurisdictions that, like this county in
    Indiana, may be at risk of an IDU-
    related HIV outbreak. These data
    include drug arrest records, overdose
    deaths, opioid sales and prescriptions,
    availability of insurance, emergency
    medical services, and social and
    demographic data.”
    - CDC “The Anatomy of an HIV Outbreak
    Response in a Rural Community”
    E. M. Campbell, H. Jia, A. Shankar, et al. “Detailed Transmission Network Analysis of
    a Large Opiate-Driven Outbreak of HIV Infection in the United States”. Journal of
    Infectious Diseases, 216(9), 27 November 2017, 1053–1062.
    https://academic.oup.com/jid/article/216/9/1053/4347235
    https://blogs.cdc.gov/publichealthmatters/2015/06/the-anatomy-of-an-hiv
    outbreak-response-in-a-rural-community/

    View full-size slide

  18. A DATA WRANGLING PLATFORM
    19
    On-Prem Data
    Agents
    ADLS
    DATA SECURITY & ACCESS CONTROLS
    TRANSPARENT DATA LINEAGE
    DATA CATALOG INTEGRATION

    View full-size slide

  19. A DATA WRANGLING PLATFORM
    20
    On-Prem
    Data
    Agents ADLS
    DATA SECURITY & ACCESS CONTROLS
    TRANSPARENT DATA LINEAGE
    DATA CATALOG INTEGRATION
    ANALYTICS DATA SCIENCE AUTOMATION

    View full-size slide

  20. Common ground?
    • Burgeoning SW market
    • n2 connections?
    • Common formats must emerge
    • Need a shared place to
    Write it down, Link it up
    • Critical to market health!

    View full-size slide

  21. 22
    ground
    A DATA CONTEXT SERVICE
    Joseph M.
    Hellerstein
    Sean Lobo
    Nipun
    Ramakrishnan
    Avinash
    Arjavalingam
    Vikram
    Sreekanti

    View full-size slide

  22. 23
    Beyond
    Metadata
    Architecture
    Model
    Serving
    Model
    Debugging
    Parsing &
    Featurization
    Catalog &
    Discovery
    Wrangling
    Analytics &
    Vis
    Data
    Quality
    Reproducibility
    Scavenging
    and Ingestion
    Search &
    Query
    Versioned
    Storage
    ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    METAMODEL
    COMMON GROUND
    Common
    Ground
    C: Version Graphs
    A: Model Graphs
    B: Usage Graphs
    The ABCs of
    Context
    Application Context: Views, models, code
    Behavioral Context: Data lineage & usage
    Change over time: Version history
    Awareness
    Community Health
    Metadata &
    Data
    Management
    Data Analysis Data Wrangling
    Perspectives on Data Context
    Wrangling Context Services
    Six years with Trifacta and
    Google Cloud Dataprep
    The Common Ground
    model, and Ground system
    Managing the rise of
    Empirical AI
    flor
    ML Lifecycle

    View full-size slide

  23. Vendor-Neutral,
    Unopinionated
    Data Context
    Services
    Beyond Metadata
    Architecture
    Model
    Serving
    Model
    Debugging
    Parsing &
    Featurization
    Catalog &
    Discovery
    Wrangling
    Analytics &
    Vis
    Data
    Quality
    Reproducibility
    Scavenging
    and Ingestion
    Search &
    Query
    Versioned
    Storage
    ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    METAMODEL
    COMMON GROUND
    Common Ground
    C: Version Graphs
    A: Model Graphs
    B: Usage Graphs
    The ABCs of Context
    Application Context: Views, models, code
    Behavioral Context: Data lineage & usage
    Change over time: Version history
    Awareness
    Community Health
    Metadata &
    Data
    Management
    Data Analysis Data Wrangling

    View full-size slide

  24. 25
    A Recurring Conversation with Big Data Community
    Metadata: 

    The last thing anybody 

    wants to work on
    Isn’t this just

    metadata?
    Community Health
    Metadata &
    Data
    Management
    Data Analysis Data Wrangling
    Time to Go Meta (on Use)
    Strata New York 2015
    Grounding Big Data
    Strata San Jose 2016
    Data Relativism
    Strata London Keynote 2016
    https://speakerdeck.com/jhellerstein

    View full-size slide

  25. What is Metadata?

    View full-size slide

  26. Data about data
    This used to be so simple!
    But … schema on use
    One of many changes
    What is Metadata?

    View full-size slide

  27. Lay the groundwork for rich
    data context.
    Opportunity: A Bigger Context
    Don’t just
    fill a
    metadata-
    sized hole
    in the big
    data
    stack.

    View full-size slide

  28. What is Data Context?
    All the information surrounding the use of data.

    View full-size slide

  29. Emerging Data Context Space

    View full-size slide

  30. is Unopinionated
    Be conservative in what you do,
    be liberal in what you accept
    from others
    Postel’s Law

    View full-size slide

  31. The ABCs of Data Context
    Generated by—and useful to—many applications and components.
    Application Context: Views, models, code
    Behavioral Context: Data lineage & usage
    Change over time: Version history

    View full-size slide

  32. C: Version Graphs
    Common Ground: A Metamodel
    A: Model Graphs
    B: Usage Graphs

    View full-size slide

  33. 34
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    METAMODEL
    COMMON GROUND
    GROUND ARCHITECTURE

    View full-size slide

  34. GROUND ARCHITECTURE
    Model
    Serving
    Model
    Debugging
    Parsing &
    Featurization
    Catalog &
    Discovery
    Wrangling
    Analytics &
    Vis
    Data
    Quality
    Reproducibility
    Scavenging
    and Ingestion
    Search &
    Query
    Versioned
    Storage
    ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    METAMODEL
    COMMON GROUND

    View full-size slide

  35. GROUND ARCHITECTURE
    Model
    Serving
    Model
    Debugging
    Parsing &
    Featurization
    Catalog &
    Discovery
    Wrangling
    Analytics &
    Vis
    Data
    Quality
    Reproducibility
    Scavenging
    and Ingestion
    Search &
    Query
    Versioned
    Storage
    ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    METAMODEL
    COMMON GROUND

    View full-size slide

  36. Current Status
    Ground Server: Release v0.1.2
    Java Play + PostgreSQL
    Grit:
    Ground over Git
    Bedrock
    Elastic coordination-free cloud storage
    “Anna” prototype ICDE 2018
    www.ground-context.org
    grit

    View full-size slide

  37. 38
    flor
    GROUNDING THE NEXT FLOWERING OF AI
    Rolando
    Garcia
    Vikram
    Sreekanti
    Dan
    Crankshaw
    Neeraja
    Yadwadkar
    Joseph
    Gonzalez
    Joseph
    Hellerstein
    Malhar
    Patel
    Sona
    Jeswani
    Eric
    Liu

    View full-size slide

  38. 39
    ML Lifecycle
    Management: A
    Context-Rich
    Application
    Empirical AI
    ML Lifecycle
    Flor: Lifecycle Mgmt
    flor
    Demo
    ML Lifecycle
    Management: A
    Context-Rich
    Application
    Big Data Context
    Beyond Metadata
    The ABCs of Context
    flor
    Common Ground
    Architecture
    Model
    Serving
    Model
    Debugging
    Parsing &
    Featurization
    Catalog &
    Discovery
    Wrangling
    Analytics &
    Vis
    Data
    Quality
    Reproducibility
    Scavenging
    and Ingestion
    Search &
    Query
    Versioned
    Storage
    ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    METAMODEL
    COMMON GROUND
    Perspectives on Data Context
    Wrangling Context Services
    Six years with Trifacta and
    Google Cloud Dataprep
    The Common Ground
    model, and Ground system
    Managing the rise of
    Empirical AI
    flor
    ML Lifecycle

    View full-size slide

  39. ML Lifecycle
    Management: A
    Context-Rich
    Application
    Empirical AI ML Lifecycle Flor: Lifecycle Mgmt
    flor
    Demo

    View full-size slide

  40. 41
    A Conversation-Starter
    Look, AI today is more like
    Experimental Science than
    Engineering! Or math!

    View full-size slide

  41. 42
    A Conversation-Starter
    OMG! Have you never
    read Herbert Simon?
    That is so 1995!

    View full-size slide

  42. 43
    AI is more Experimental Science than Engineering
    Not a new observation

    View full-size slide

  43. 44
    AI is more Experimental Science than Engineering
    Not a new observation
    But increasingly timely
    Overseen by tweakers

    View full-size slide

  44. 45
    The Fourth Paradigm of Science
    That was for “plain old” science!
    ML & AI generate combinatorially more
    experiments and data
    A Transformed Scientific Method
    E HAVE TO DO BETTER AT PRODUCING TOOLS to support the whole re-
    search cycle—from data capture and data curation to data analysis
    and data visualization. Today, the tools for capturing data both at
    the mega-scale and at the milli-scale are just dreadful. After you
    have captured the data, you need to curate it before you can start doing any kind of
    data analysis, and we lack good tools for both data curation and data analysis. Then
    comes the publication of the results of your research, and the published literature
    is just the tip of the data iceberg. By this I mean that people collect a lot of data and
    then reduce this down to some number of column inches in Science or Nature—or
    10 pages if it is a computer science person writing. So what I mean by data iceberg
    is that there is a lot of data that is collected but not curated or published in any
    systematic way. There are some exceptions, and I think that these cases are a good
    place for us to look for best practices. I will talk about how the whole process of
    peer review has got to change and the way in which I think it is changing and what
    CSTB can do to help all of us get access to our research.
    W
    1 National Research Council, http://sites.nationalacademies.org/NRC/index.htm; Computer Science and Telecom-
    munications Board, http://sites.nationalacademies.org/cstb/index.htm.
    2 This presentation is, poignantly, the last one posted to Jim’s Web page at Microsoft Research before he went missing
    at sea on January 28, 2007—http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt.
    EDITED BY TONY HEY, STEWART TANSLEY, AND KRISTIN TOLLE | Microsoft Research
    Based on the transcript of a talk given by Jim Gray
    to the NRC-CSTB1 in Mountain View, CA, on January 11, 20072
    Jim Gray, 2007

    View full-size slide

  45. 46
    We have to do better
    at producing tools to
    support the whole
    research cycle!

    View full-size slide

  46. 47
    ML applications are tightly tied to the ML lifecycle

    View full-size slide

  47. 4
    8
    ML applications are tightly tied to the ML lifecycle

    View full-size slide

  48. 4
    9
    ML applications are tightly tied to the ML lifecycle

    View full-size slide

  49. 5
    0
    ML applications are tightly tied to the ML lifecycle

    View full-size slide

  50. 51
    PROBLEM
    The ML lifecycle is poorly tooled

    View full-size slide

  51. 52
    • How we develop pipelines
    • Undo/Redo in model design is lacking
    • Failure to detect poor methods
    • Version skew: easy for versions to diverge
    • How we use pipelines: their resources and products
    • Difficult to predict how models will affect system behavior
    • Changes to data may not be tracked or recorded
    • No record of who uses which resources and why
    • Disorganized models are easy to lose and hard to find
    The ML lifecycle is poorly tooled

    View full-size slide

  52. 53
    Version Skew: easy for versions to diverge

    View full-size slide

  53. 5
    4
    Disorganized models are easy to lose and hard to find
    Models will likely be organized by an individual’s standard, but not by
    an organization’s standards.
    https://xkcd.com/1459/ http://dilbert.com/strip/2011-04-23

    View full-size slide

  54. 55
    Failure to detect poor methods
    • Data dredging or P-hacking
    • Weak isolation of test data
    • Training on attributes
    that are unknown
    during testing time
    Nature, 25 May 2016
    https://xkcd.com/882/

    View full-size slide

  55. 56
    Goals for flor
    1. Enable safe and agile exploration of alternative model designs
    2. Passively track and sync the history and versions of a pipeline and its
    executions across multiple machines
    3. Answer questions about the history and provenance, and procure artifacts
    from the versions

    Approach:

    Build a system to leverage widely used tools in a principled manner.

    View full-size slide

  56. 57
    Flor lives Above Ground

    Unlike Ground, Flor is “opinionated”.

    Three basic subclasses of Node
    Artifact
    Literal
    Action

    View full-size slide

  57. 58
    Flor lives Above Ground

    Unlike Ground, Flor is “opinionated”.

    Three basic subclasses of Node

    And Edges to capture workflow

    Artifacts to Actions, Literals to Actions,
    Actions to Artifacts
    Artifact
    Literal
    Action
    Artifact

    View full-size slide

  58. 59
    Action
    Artifact
    Literal
    Artifact
    Flor lives Above Ground

    Unlike Ground, Flor is “opinionated”.

    Three basic subclasses of Node

    And Edges to capture workflow

    Artifacts to Actions, Literals to Actions,
    Actions to Artifacts

    Versions of Artifacts/Literals/Edges

    git hashes
    Artifact
    Literal
    Action
    Artifact
    #86a1c71bc
    #cde3e1c51
    #cde3e1c5
    #86a1c71b
    #86a1c71bc
    #cde3e1c51
    #86a1c71bc
    #cde3e1c51

    View full-size slide

  59. 6
    0
    Action
    Artifact
    Literal
    Artifact
    Flor lives Above Ground

    Unlike Ground, Flor is “opinionated”.

    Three basic subclasses of Node

    And Edges to capture workflow

    Artifacts to Actions, Literals to Actions,
    Actions to Artifacts

    Versions of Artifacts/Literals/Edges

    git hashes

    Lineage Edges:

    Track ArtifactVersions generated in workflows
    Artifact
    Literal
    Action
    Artifact
    #86a1c71bc
    #cde3e1c51
    #cde3e1c5
    #86a1c71b
    #86a1c71bc
    #cde3e1c51
    #86a1c71bc
    #cde3e1c51

    View full-size slide

  60. 61
    USE CASE:
    “I have an existing pipeline, how can Flor help me?”

    View full-size slide

  61. 6
    3
    Taxi.ipynb
    run_existing_pipeline
    train.csv
    model.pkl score.txt rmse.txt

    View full-size slide

  62. 6
    4
    When the pipeline executes...
    • Data Versioning: All artifacts are
    versioned in Git, and associated
    with their respective experiments

    New run = New commit
    • Metadata versioning: git history
    reflected in ground.
    ArtifactVersions autogenerated to
    track git commits.
    • Provenance: The provenance
    relationships between objects
    (artifacts or otherwise) are
    recorded in Ground

    View full-size slide

  63. 65
    Because we version and record data context...
    • Materialize any artifact, in context

    Know which artifact to materialize
    • Replay all previous experiments,
    with new data

    [Opportunity] Sync local and
    remote versions of the pipeline,
    run the pipeline anywhere

    View full-size slide

  64. 6
    6
    USE CASE:
    “Ok, how can Flor help me refine my pipeline?”

    View full-size slide

  65. 6
    8
    3x
    Taxi.ipynb
    run_existing_pipeline
    train.csv num_estimators
    model.pkl score.txt rmse.txt

    View full-size slide

  66. 6
    9
    When the pipeline executes...
    • Data Versioning: All artifacts are
    versioned in Git, and associated
    with their respective experiments
    • Metadata versioning: git history
    reflected in ground.
    ArtifactVersions autogenerated to
    track git commits.
    • Provenance: The provenance
    relationships between objects
    (artifacts or otherwise) are
    recorded in Ground

    Parallel multi-trial experiments

    Our example (3x): num_est=15,
    num_est=20, num_est=30.

    View full-size slide

  67. 7
    0
    Because we declare and track a literal...
    • Materialize any artifact, in richer
    context

    Know which artifact to materialize
    • Replay all previous experiments, with
    new data

    [Opportunity] Sync local and remote
    versions of the pipeline, run the
    pipeline anywhere

    [Opportunity] Scripting, set literal
    from the command line or externally

    View full-size slide

  68. 71
    USE CASE:
    “I’ll build my next pipeline with Flor from the start.”

    View full-size slide

  69. 74
    3x
    Taxi.ipynb
    test
    Taxi.ipynb
    train
    Taxi.ipynb
    split
    Taxi.ipynb
    preproc
    Taxi.ipynb
    calculate_distance
    Taxi.ipynb
    dataframize
    train.csv
    num_estimators xTrain.pkl xTest.pkl
    yTrain.pkl yTest.pkl
    train_ready.pkl
    train_dist_df.pkl
    train_df.pkl
    score.txt rmse.txt
    model.pkl

    View full-size slide

  70. 75
    When the pipeline executes...
    • Versioning: All artifacts are
    versioned in Git, and associated
    with their respective experiments

    New run = New commit
    • Provenance: The relationships
    between objects, artifacts or
    otherwise, are recorded in
    Ground
    • Parallel multi-trial experiments
    • Trial invariant artifacts don’t have
    to be recomputed

    View full-size slide

  71. 76
    Because we built a pipeline with Flor...
    • Materialize any artifact, in richer context

    Know which artifact to materialize
    • Replay all previous experiments, with
    new data

    Share resources, with the corresponding
    changes

    Swap components

    Maintain the pipeline

    [Opportunity] Inter-version Parallelism

    [Opportunity] Undo/Redo

    View full-size slide

  72. We automatically track all the metadata, context, and lineage
    with
    ● Timestamps
    ● Which resources your
    experiment used
    ● How many trials your
    experiment ran
    ● What the configuration
    was per trial
    ● The evolution of your
    experiment over time
    (versions)
    ● The lineage that derived any
    artifact in the workflow
    ● The metadata you need to
    retrieve a physical copy of
    any artifact in the workflow,
    ever
    ● The current state of your
    experiment in the file system,
    in context
    ● Whether you’ve forked any
    experiment resources, and
    which ones
    ● When you executed an experiment, whether
    you executed it to completion, or only
    partially
    ● Whether you’ve peeked at intermediary
    results during interactive pipeline
    development, and what you did in Flor after
    you learned this information
    ● Whether you peek at the same result
    multiple times, or each time peek at a
    different trial and see a different result
    ● The location of the peeked artifacts so they
    may be re-used in future computations
    without repeating work
    ● Whether two specifications belonging to
    the same experiment used the same or
    different resources, and whether they
    derived the same artifacts.
    ● Whether any resource or artifact was
    renamed
    ● ….

    View full-size slide

  73. 78
    CONCLUSION

    View full-size slide

  74. 79
    Perspectives on Data Context
    Six years with Trifacta and
    Google Cloud Dataprep
    The Common Ground
    model, and Ground system
    Managing the rise of
    Empirical AI
    Wrangling Context Services
    flor
    ML Lifecycle

    View full-size slide

  75. METAMODEL
    Wrangling
    Context Services
    flor
    ML Lifecycle
    Opportunity At All Levels

    View full-size slide

  76. Opportunity At All Levels
    Application-Specific Context Generation and Mining
    Data Context Modeling
    Systems Infrastructure for Data Context Management

    View full-size slide

  77. 82
    We have to do better
    at producing tools to
    support the whole
    research cycle!
    One of the most high-impact (and fun!) topics in CS today.

    View full-size slide

  78. http://ground-context.org
    flor
    http://github.com/ucbrise/jarvis
    Joe Hellerstein
    UC Berkeley / Trifacta
    [email protected]
    8
    3
    Context at Berkeley

    View full-size slide