Slide 1

Slide 1 text

Data Context In the Field and in the Lab JOE HELLERSTEIN, UC BERKELEY 1 1ST WORKSHOP ON CONTEXT IN ANALYTICS, PARIS, APRIL 2018

Slide 2

Slide 2 text

2 Perspectives on Data Context Wrangling Context Services Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle

Slide 3

Slide 3 text

3 An Overview for Academics April 24, 2017

Slide 4

Slide 4 text

4 Data Preparation: A key source of data context. Shifting Market WHO WHA T WHER E Problem Relevance 80% Data Prep Market Use Cases Platform Challenges O n-Pre m Da ta A g e n t s A D L S D AT A S E CU RIT Y & ACCE S S CON T ROLS T RAN S P ARE N T D AT A LIN E AG E D AT A CAT ALOG IN T E G RAT ION ANALYTICS DATA SCIENCE AUTOMATION Perspectives on Data Context Wrangling Context Services Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle

Slide 5

Slide 5 text

Data Preparation: A key source of data context. Shifting Market WHO WHAT WHERE Problem Relevance 80% Data Prep Market Use Cases Platform Challenges O n-Pre m Da ta A ge nts A D LS D AT A S E CU RIT Y & ACCE S S CON T ROLS T RAN S P ARE N T D AT A LIN E AG E D AT A CAT ALOG IN T E G RAT ION ANALYTICS DATA SCIENCE AUTOMATION

Slide 6

Slide 6 text

1990’s: IT Governance In the Data Warehouse “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse, 2005

Slide 7

Slide 7 text

2018: Business Value From A Quantified World “Get into the mindset to collect and measure everything you can.” — DJ Patil, Building Data Science Teams, 2011

Slide 8

Slide 8 text

“Get into the mindset to collect and measure everything you can.” — DJ Patil, Building Data Science Teams, 2011 “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse, 2005 2018: Business Value From A Quantified World

Slide 9

Slide 9 text

The World is Changing 7 End-User Self-Service WHO Power Shift Big Data Analytics Empirical AI WHAT Data Shift Open Source Cloud WHERE Platform Shift

Slide 10

Slide 10 text

Whitespace: Data Wrangling DATA PLATFORMS ANALYSIS & CONSUMPTION 80% “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.” — DJ Patil, Data Jujitsu, O’Reilly Media 2012

Slide 11

Slide 11 text

Research Roots: Open Source Data Wrangler, 2011 + Predictive Interaction + Immediate feedback on multiple choices + Data quality indicators “Wrangler: Interactive Visual Specification of Data Transformation Scripts.” S. Kandel, A.Paepcke, J.M. Hellerstein, J. Heer. CHI 2011. “Proactive Wrangling: Mixed-Initiative End-User Programming of Data Transformation Scripts”, P.J. Guo, S. Kandel, J. M. Hellerstein, J. Heer. UIST 2011. “Enterprise Data Analysis and Visualization: An Interview Study.” S. Kandel, A. Paepcke, J.M. Hellerstein, and J. Heer IEEE VAST 2012 “Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment”, S. Kandel, et al. AVI 2012. “Predictive Interaction for Data Transformation” J. Heer, J.M. Hellerstein and S. Kandel CIDR 2015

Slide 12

Slide 12 text

The State of the Data Prep Market, 2018 Market category well established Forrester did its first “Wave” ranking in 2017 Gartner now estimates > $1 Billion market for Data Prep by 2021 12 “Trifacta delivers a strong balance for self-service by analysts and business users. Customer references gave high marks to Trifacta’s ease of use. Trifacta leverages machine learning algorithms to automate and simplify the interaction with data.”

Slide 13

Slide 13 text

Data Wrangling Standard Across Industry Leaders Proprietary & Confidential 13 Financial Services Insurance Healthcare and Pharmaceuticals Retail and Consumer Goods Government Agencies

Slide 14

Slide 14 text

A DATA WRANGLING PLATFORM 14 On-Prem Data Agents ADLS DATA SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION

Slide 15

Slide 15 text

A DATA WRANGLING PLATFORM 15 On-Prem Data Agents ADLS DATA SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION Millions of lines of recipes Multi-terabyte flows 24x7 Elasticity & Scalability

Slide 16

Slide 16 text

Use Case

Slide 17

Slide 17 text

CDC AIDS Intervention in Indiana

Slide 18

Slide 18 text

The CDC reduced data preparation from 3 months to 3 days with Trifacta • Refuted an assumption in analysis that would not have been possible without enriching datasets • Expects to scale this model to other, similar outbreaks, such as Zika or Ebola Benefits In future, we need to combine “a variety of sources to identify jurisdictions that, like this county in Indiana, may be at risk of an IDU- related HIV outbreak. These data include drug arrest records, overdose deaths, opioid sales and prescriptions, availability of insurance, emergency medical services, and social and demographic data.” - CDC “The Anatomy of an HIV Outbreak Response in a Rural Community” E. M. Campbell, H. Jia, A. Shankar, et al. “Detailed Transmission Network Analysis of a Large Opiate-Driven Outbreak of HIV Infection in the United States”. Journal of Infectious Diseases, 216(9), 27 November 2017, 1053–1062. https://academic.oup.com/jid/article/216/9/1053/4347235 https://blogs.cdc.gov/publichealthmatters/2015/06/the-anatomy-of-an-hiv outbreak-response-in-a-rural-community/

Slide 19

Slide 19 text

A DATA WRANGLING PLATFORM 19 On-Prem Data Agents ADLS DATA SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION

Slide 20

Slide 20 text

A DATA WRANGLING PLATFORM 20 On-Prem Data Agents ADLS DATA SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION ANALYTICS DATA SCIENCE AUTOMATION

Slide 21

Slide 21 text

Common ground? • Burgeoning SW market • n2 connections? • Common formats must emerge • Need a shared place to Write it down, Link it up • Critical to market health!

Slide 22

Slide 22 text

22 ground A DATA CONTEXT SERVICE Joseph M. Hellerstein Sean Lobo Nipun Ramakrishnan Avinash Arjavalingam Vikram Sreekanti

Slide 23

Slide 23 text

23 Beyond Metadata Architecture Model Serving Model Debugging Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND Common Ground C: Version Graphs A: Model Graphs B: Usage Graphs The ABCs of Context Application Context: Views, models, code Behavioral Context: Data lineage & usage Change over time: Version history Awareness Community Health Metadata & Data Management Data Analysis Data Wrangling Perspectives on Data Context Wrangling Context Services Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle

Slide 24

Slide 24 text

Vendor-Neutral, Unopinionated Data Context Services Beyond Metadata Architecture Model Serving Model Debugging Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND Common Ground C: Version Graphs A: Model Graphs B: Usage Graphs The ABCs of Context Application Context: Views, models, code Behavioral Context: Data lineage & usage Change over time: Version history Awareness Community Health Metadata & Data Management Data Analysis Data Wrangling

Slide 25

Slide 25 text

25 A Recurring Conversation with Big Data Community Metadata: 
 The last thing anybody 
 wants to work on Isn’t this just
 metadata? Community Health Metadata & Data Management Data Analysis Data Wrangling Time to Go Meta (on Use) Strata New York 2015 Grounding Big Data Strata San Jose 2016 Data Relativism Strata London Keynote 2016 https://speakerdeck.com/jhellerstein

Slide 26

Slide 26 text

What is Metadata?

Slide 27

Slide 27 text

Data about data This used to be so simple! But … schema on use One of many changes What is Metadata?

Slide 28

Slide 28 text

Lay the groundwork for rich data context. Opportunity: A Bigger Context Don’t just fill a metadata- sized hole in the big data stack.

Slide 29

Slide 29 text

What is Data Context? All the information surrounding the use of data.

Slide 30

Slide 30 text

Emerging Data Context Space

Slide 31

Slide 31 text

is Unopinionated Be conservative in what you do, be liberal in what you accept from others Postel’s Law

Slide 32

Slide 32 text

The ABCs of Data Context Generated by—and useful to—many applications and components. Application Context: Views, models, code Behavioral Context: Data lineage & usage Change over time: Version history

Slide 33

Slide 33 text

C: Version Graphs Common Ground: A Metamodel A: Model Graphs B: Usage Graphs

Slide 34

Slide 34 text

34 ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND GROUND ARCHITECTURE

Slide 35

Slide 35 text

GROUND ARCHITECTURE Model Serving Model Debugging Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND

Slide 36

Slide 36 text

GROUND ARCHITECTURE Model Serving Model Debugging Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND

Slide 37

Slide 37 text

Current Status Ground Server: Release v0.1.2 Java Play + PostgreSQL Grit: Ground over Git Bedrock Elastic coordination-free cloud storage “Anna” prototype ICDE 2018 www.ground-context.org grit

Slide 38

Slide 38 text

38 flor GROUNDING THE NEXT FLOWERING OF AI Rolando Garcia Vikram Sreekanti Dan Crankshaw Neeraja Yadwadkar Joseph Gonzalez Joseph Hellerstein Malhar Patel Sona Jeswani Eric Liu

Slide 39

Slide 39 text

39 ML Lifecycle Management: A Context-Rich Application Empirical AI ML Lifecycle Flor: Lifecycle Mgmt flor Demo ML Lifecycle Management: A Context-Rich Application Big Data Context Beyond Metadata The ABCs of Context flor Common Ground Architecture Model Serving Model Debugging Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND Perspectives on Data Context Wrangling Context Services Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle

Slide 40

Slide 40 text

ML Lifecycle Management: A Context-Rich Application Empirical AI ML Lifecycle Flor: Lifecycle Mgmt flor Demo

Slide 41

Slide 41 text

41 A Conversation-Starter Look, AI today is more like Experimental Science than Engineering! Or math!

Slide 42

Slide 42 text

42 A Conversation-Starter OMG! Have you never read Herbert Simon? That is so 1995!

Slide 43

Slide 43 text

43 AI is more Experimental Science than Engineering Not a new observation

Slide 44

Slide 44 text

44 AI is more Experimental Science than Engineering Not a new observation But increasingly timely Overseen by tweakers

Slide 45

Slide 45 text

45 The Fourth Paradigm of Science That was for “plain old” science! ML & AI generate combinatorially more experiments and data A Transformed Scientific Method E HAVE TO DO BETTER AT PRODUCING TOOLS to support the whole re- search cycle—from data capture and data curation to data analysis and data visualization. Today, the tools for capturing data both at the mega-scale and at the milli-scale are just dreadful. After you have captured the data, you need to curate it before you can start doing any kind of data analysis, and we lack good tools for both data curation and data analysis. Then comes the publication of the results of your research, and the published literature is just the tip of the data iceberg. By this I mean that people collect a lot of data and then reduce this down to some number of column inches in Science or Nature—or 10 pages if it is a computer science person writing. So what I mean by data iceberg is that there is a lot of data that is collected but not curated or published in any systematic way. There are some exceptions, and I think that these cases are a good place for us to look for best practices. I will talk about how the whole process of peer review has got to change and the way in which I think it is changing and what CSTB can do to help all of us get access to our research. W 1 National Research Council, http://sites.nationalacademies.org/NRC/index.htm; Computer Science and Telecom- munications Board, http://sites.nationalacademies.org/cstb/index.htm. 2 This presentation is, poignantly, the last one posted to Jim’s Web page at Microsoft Research before he went missing at sea on January 28, 2007—http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt. EDITED BY TONY HEY, STEWART TANSLEY, AND KRISTIN TOLLE | Microsoft Research Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on January 11, 20072 Jim Gray, 2007

Slide 46

Slide 46 text

46 We have to do better at producing tools to support the whole research cycle!

Slide 47

Slide 47 text

47 ML applications are tightly tied to the ML lifecycle

Slide 48

Slide 48 text

4 8 ML applications are tightly tied to the ML lifecycle

Slide 49

Slide 49 text

4 9 ML applications are tightly tied to the ML lifecycle

Slide 50

Slide 50 text

5 0 ML applications are tightly tied to the ML lifecycle

Slide 51

Slide 51 text

51 PROBLEM The ML lifecycle is poorly tooled

Slide 52

Slide 52 text

52 • How we develop pipelines • Undo/Redo in model design is lacking • Failure to detect poor methods • Version skew: easy for versions to diverge • How we use pipelines: their resources and products • Difficult to predict how models will affect system behavior • Changes to data may not be tracked or recorded • No record of who uses which resources and why • Disorganized models are easy to lose and hard to find The ML lifecycle is poorly tooled

Slide 53

Slide 53 text

53 Version Skew: easy for versions to diverge

Slide 54

Slide 54 text

5 4 Disorganized models are easy to lose and hard to find Models will likely be organized by an individual’s standard, but not by an organization’s standards. https://xkcd.com/1459/ http://dilbert.com/strip/2011-04-23

Slide 55

Slide 55 text

55 Failure to detect poor methods • Data dredging or P-hacking • Weak isolation of test data • Training on attributes that are unknown during testing time Nature, 25 May 2016 https://xkcd.com/882/

Slide 56

Slide 56 text

56 Goals for flor 1. Enable safe and agile exploration of alternative model designs 2. Passively track and sync the history and versions of a pipeline and its executions across multiple machines 3. Answer questions about the history and provenance, and procure artifacts from the versions • Approach: • Build a system to leverage widely used tools in a principled manner.

Slide 57

Slide 57 text

57 Flor lives Above Ground • Unlike Ground, Flor is “opinionated”. • Three basic subclasses of Node Artifact Literal Action

Slide 58

Slide 58 text

58 Flor lives Above Ground • Unlike Ground, Flor is “opinionated”. • Three basic subclasses of Node • And Edges to capture workflow • Artifacts to Actions, Literals to Actions, Actions to Artifacts Artifact Literal Action Artifact

Slide 59

Slide 59 text

59 Action Artifact Literal Artifact Flor lives Above Ground • Unlike Ground, Flor is “opinionated”. • Three basic subclasses of Node • And Edges to capture workflow • Artifacts to Actions, Literals to Actions, Actions to Artifacts • Versions of Artifacts/Literals/Edges • git hashes Artifact Literal Action Artifact #86a1c71bc #cde3e1c51 #cde3e1c5 #86a1c71b #86a1c71bc #cde3e1c51 #86a1c71bc #cde3e1c51

Slide 60

Slide 60 text

6 0 Action Artifact Literal Artifact Flor lives Above Ground • Unlike Ground, Flor is “opinionated”. • Three basic subclasses of Node • And Edges to capture workflow • Artifacts to Actions, Literals to Actions, Actions to Artifacts • Versions of Artifacts/Literals/Edges • git hashes • Lineage Edges: • Track ArtifactVersions generated in workflows Artifact Literal Action Artifact #86a1c71bc #cde3e1c51 #cde3e1c5 #86a1c71b #86a1c71bc #cde3e1c51 #86a1c71bc #cde3e1c51

Slide 61

Slide 61 text

61 USE CASE: “I have an existing pipeline, how can Flor help me?”

Slide 62

Slide 62 text

62

Slide 63

Slide 63 text

6 3 Taxi.ipynb run_existing_pipeline train.csv model.pkl score.txt rmse.txt

Slide 64

Slide 64 text

6 4 When the pipeline executes... • Data Versioning: All artifacts are versioned in Git, and associated with their respective experiments • New run = New commit • Metadata versioning: git history reflected in ground. ArtifactVersions autogenerated to track git commits. • Provenance: The provenance relationships between objects (artifacts or otherwise) are recorded in Ground

Slide 65

Slide 65 text

65 Because we version and record data context... • Materialize any artifact, in context • Know which artifact to materialize • Replay all previous experiments, with new data • [Opportunity] Sync local and remote versions of the pipeline, run the pipeline anywhere

Slide 66

Slide 66 text

6 6 USE CASE: “Ok, how can Flor help me refine my pipeline?”

Slide 67

Slide 67 text

67

Slide 68

Slide 68 text

6 8 3x Taxi.ipynb run_existing_pipeline train.csv num_estimators model.pkl score.txt rmse.txt

Slide 69

Slide 69 text

6 9 When the pipeline executes... • Data Versioning: All artifacts are versioned in Git, and associated with their respective experiments • Metadata versioning: git history reflected in ground. ArtifactVersions autogenerated to track git commits. • Provenance: The provenance relationships between objects (artifacts or otherwise) are recorded in Ground • Parallel multi-trial experiments • Our example (3x): num_est=15, num_est=20, num_est=30.

Slide 70

Slide 70 text

7 0 Because we declare and track a literal... • Materialize any artifact, in richer context • Know which artifact to materialize • Replay all previous experiments, with new data • [Opportunity] Sync local and remote versions of the pipeline, run the pipeline anywhere • [Opportunity] Scripting, set literal from the command line or externally

Slide 71

Slide 71 text

71 USE CASE: “I’ll build my next pipeline with Flor from the start.”

Slide 72

Slide 72 text

72

Slide 73

Slide 73 text

73

Slide 74

Slide 74 text

74 3x Taxi.ipynb test Taxi.ipynb train Taxi.ipynb split Taxi.ipynb preproc Taxi.ipynb calculate_distance Taxi.ipynb dataframize train.csv num_estimators xTrain.pkl xTest.pkl yTrain.pkl yTest.pkl train_ready.pkl train_dist_df.pkl train_df.pkl score.txt rmse.txt model.pkl

Slide 75

Slide 75 text

75 When the pipeline executes... • Versioning: All artifacts are versioned in Git, and associated with their respective experiments • New run = New commit • Provenance: The relationships between objects, artifacts or otherwise, are recorded in Ground • Parallel multi-trial experiments • Trial invariant artifacts don’t have to be recomputed

Slide 76

Slide 76 text

76 Because we built a pipeline with Flor... • Materialize any artifact, in richer context • Know which artifact to materialize • Replay all previous experiments, with new data • Share resources, with the corresponding changes • Swap components • Maintain the pipeline • [Opportunity] Inter-version Parallelism • [Opportunity] Undo/Redo

Slide 77

Slide 77 text

We automatically track all the metadata, context, and lineage with ● Timestamps ● Which resources your experiment used ● How many trials your experiment ran ● What the configuration was per trial ● The evolution of your experiment over time (versions) ● The lineage that derived any artifact in the workflow ● The metadata you need to retrieve a physical copy of any artifact in the workflow, ever ● The current state of your experiment in the file system, in context ● Whether you’ve forked any experiment resources, and which ones ● When you executed an experiment, whether you executed it to completion, or only partially ● Whether you’ve peeked at intermediary results during interactive pipeline development, and what you did in Flor after you learned this information ● Whether you peek at the same result multiple times, or each time peek at a different trial and see a different result ● The location of the peeked artifacts so they may be re-used in future computations without repeating work ● Whether two specifications belonging to the same experiment used the same or different resources, and whether they derived the same artifacts. ● Whether any resource or artifact was renamed ● ….

Slide 78

Slide 78 text

78 CONCLUSION

Slide 79

Slide 79 text

79 Perspectives on Data Context Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI Wrangling Context Services flor ML Lifecycle

Slide 80

Slide 80 text

METAMODEL Wrangling Context Services flor ML Lifecycle Opportunity At All Levels

Slide 81

Slide 81 text

Opportunity At All Levels Application-Specific Context Generation and Mining Data Context Modeling Systems Infrastructure for Data Context Management

Slide 82

Slide 82 text

82 We have to do better at producing tools to support the whole research cycle! One of the most high-impact (and fun!) topics in CS today.

Slide 83

Slide 83 text

http://ground-context.org flor http://github.com/ucbrise/jarvis Joe Hellerstein UC Berkeley / Trifacta [email protected] 8 3 Context at Berkeley