Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Context 
In the Field and in the Lab

Data Context 
In the Field and in the Lab

Keynote talk, 1st Workshop on Context in Analytics, IEEE International Conference on Data Engineering (ICDE) 2018.

Joe Hellerstein

April 16, 2018
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. Data Context In the Field and in the Lab JOE

    HELLERSTEIN, UC BERKELEY 1 1ST WORKSHOP ON CONTEXT IN ANALYTICS, PARIS, APRIL 2018
  2. 2 Perspectives on Data Context Wrangling Context Services Six years

    with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle
  3. 4 Data Preparation: A key source of data context. Shifting

    Market WHO WHA T WHER E Problem Relevance 80% Data Prep Market Use Cases Platform Challenges O n-Pre m Da ta A g e n t s A D L S D AT A S E CU RIT Y & ACCE S S CON T ROLS T RAN S P ARE N T D AT A LIN E AG E D AT A CAT ALOG IN T E G RAT ION ANALYTICS DATA SCIENCE AUTOMATION Perspectives on Data Context Wrangling Context Services Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle
  4. Data Preparation: A key source of data context. Shifting Market

    WHO WHAT WHERE Problem Relevance 80% Data Prep Market Use Cases Platform Challenges O n-Pre m Da ta A ge nts A D LS D AT A S E CU RIT Y & ACCE S S CON T ROLS T RAN S P ARE N T D AT A LIN E AG E D AT A CAT ALOG IN T E G RAT ION ANALYTICS DATA SCIENCE AUTOMATION
  5. 1990’s: IT Governance In the Data Warehouse “There is no

    point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse, 2005
  6. 2018: Business Value From A Quantified World “Get into the

    mindset to collect and measure everything you can.” — DJ Patil, Building Data Science Teams, 2011
  7. “Get into the mindset to collect and measure everything you

    can.” — DJ Patil, Building Data Science Teams, 2011 “There is no point in bringing data … into the data warehouse environment without integrating it.” — Bill Inmon, Building the Data Warehouse, 2005 2018: Business Value From A Quantified World
  8. The World is Changing 7 End-User Self-Service WHO Power Shift

    Big Data Analytics Empirical AI WHAT Data Shift Open Source Cloud WHERE Platform Shift
  9. Whitespace: Data Wrangling DATA PLATFORMS ANALYSIS & CONSUMPTION 80% “It’s

    impossible to overstress this: 80% of the work in any data project is in cleaning the data.” — DJ Patil, Data Jujitsu, O’Reilly Media 2012
  10. Research Roots: Open Source Data Wrangler, 2011 + Predictive Interaction

    + Immediate feedback on multiple choices + Data quality indicators “Wrangler: Interactive Visual Specification of Data Transformation Scripts.” S. Kandel, A.Paepcke, J.M. Hellerstein, J. Heer. CHI 2011. “Proactive Wrangling: Mixed-Initiative End-User Programming of Data Transformation Scripts”, P.J. Guo, S. Kandel, J. M. Hellerstein, J. Heer. UIST 2011. “Enterprise Data Analysis and Visualization: An Interview Study.” S. Kandel, A. Paepcke, J.M. Hellerstein, and J. Heer IEEE VAST 2012 “Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment”, S. Kandel, et al. AVI 2012. “Predictive Interaction for Data Transformation” J. Heer, J.M. Hellerstein and S. Kandel CIDR 2015
  11. The State of the Data Prep Market, 2018 Market category

    well established Forrester did its first “Wave” ranking in 2017 Gartner now estimates > $1 Billion market for Data Prep by 2021 12 “Trifacta delivers a strong balance for self-service by analysts and business users. Customer references gave high marks to Trifacta’s ease of use. Trifacta leverages machine learning algorithms to automate and simplify the interaction with data.”
  12. Data Wrangling Standard Across Industry Leaders Proprietary & Confidential 13

    Financial Services Insurance Healthcare and Pharmaceuticals Retail and Consumer Goods Government Agencies
  13. A DATA WRANGLING PLATFORM 14 On-Prem Data Agents ADLS DATA

    SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION
  14. A DATA WRANGLING PLATFORM 15 On-Prem Data Agents ADLS DATA

    SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION Millions of lines of recipes Multi-terabyte flows 24x7 Elasticity & Scalability
  15. The CDC reduced data preparation from 3 months to 3

    days with Trifacta • Refuted an assumption in analysis that would not have been possible without enriching datasets • Expects to scale this model to other, similar outbreaks, such as Zika or Ebola Benefits In future, we need to combine “a variety of sources to identify jurisdictions that, like this county in Indiana, may be at risk of an IDU- related HIV outbreak. These data include drug arrest records, overdose deaths, opioid sales and prescriptions, availability of insurance, emergency medical services, and social and demographic data.” - CDC “The Anatomy of an HIV Outbreak Response in a Rural Community” E. M. Campbell, H. Jia, A. Shankar, et al. “Detailed Transmission Network Analysis of a Large Opiate-Driven Outbreak of HIV Infection in the United States”. Journal of Infectious Diseases, 216(9), 27 November 2017, 1053–1062. https://academic.oup.com/jid/article/216/9/1053/4347235 https://blogs.cdc.gov/publichealthmatters/2015/06/the-anatomy-of-an-hiv outbreak-response-in-a-rural-community/
  16. A DATA WRANGLING PLATFORM 19 On-Prem Data Agents ADLS DATA

    SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION
  17. A DATA WRANGLING PLATFORM 20 On-Prem Data Agents ADLS DATA

    SECURITY & ACCESS CONTROLS TRANSPARENT DATA LINEAGE DATA CATALOG INTEGRATION ANALYTICS DATA SCIENCE AUTOMATION
  18. Common ground? • Burgeoning SW market • n2 connections? •

    Common formats must emerge • Need a shared place to Write it down, Link it up • Critical to market health!
  19. 22 ground A DATA CONTEXT SERVICE Joseph M. Hellerstein Sean

    Lobo Nipun Ramakrishnan Avinash Arjavalingam Vikram Sreekanti
  20. 23 Beyond Metadata Architecture Model Serving Model Debugging Parsing &

    Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND Common Ground C: Version Graphs A: Model Graphs B: Usage Graphs The ABCs of Context Application Context: Views, models, code Behavioral Context: Data lineage & usage Change over time: Version history Awareness Community Health Metadata & Data Management Data Analysis Data Wrangling Perspectives on Data Context Wrangling Context Services Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle
  21. Vendor-Neutral, Unopinionated Data Context Services Beyond Metadata Architecture Model Serving

    Model Debugging Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND Common Ground C: Version Graphs A: Model Graphs B: Usage Graphs The ABCs of Context Application Context: Views, models, code Behavioral Context: Data lineage & usage Change over time: Version history Awareness Community Health Metadata & Data Management Data Analysis Data Wrangling
  22. 25 A Recurring Conversation with Big Data Community Metadata: 


    The last thing anybody 
 wants to work on Isn’t this just
 metadata? Community Health Metadata & Data Management Data Analysis Data Wrangling Time to Go Meta (on Use) Strata New York 2015 Grounding Big Data Strata San Jose 2016 Data Relativism Strata London Keynote 2016 https://speakerdeck.com/jhellerstein
  23. Data about data This used to be so simple! But

    … schema on use One of many changes What is Metadata?
  24. Lay the groundwork for rich data context. Opportunity: A Bigger

    Context Don’t just fill a metadata- sized hole in the big data stack.
  25. is Unopinionated Be conservative in what you do, be liberal

    in what you accept from others Postel’s Law
  26. The ABCs of Data Context Generated by—and useful to—many applications

    and components. Application Context: Views, models, code Behavioral Context: Data lineage & usage Change over time: Version history
  27. GROUND ARCHITECTURE Model Serving Model Debugging Parsing & Featurization Catalog

    & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND
  28. GROUND ARCHITECTURE Model Serving Model Debugging Parsing & Featurization Catalog

    & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND
  29. Current Status Ground Server: Release v0.1.2 Java Play + PostgreSQL

    Grit: Ground over Git Bedrock Elastic coordination-free cloud storage “Anna” prototype ICDE 2018 www.ground-context.org grit
  30. 38 flor GROUNDING THE NEXT FLOWERING OF AI Rolando Garcia

    Vikram Sreekanti Dan Crankshaw Neeraja Yadwadkar Joseph Gonzalez Joseph Hellerstein Malhar Patel Sona Jeswani Eric Liu
  31. 39 ML Lifecycle Management: A Context-Rich Application Empirical AI ML

    Lifecycle Flor: Lifecycle Mgmt flor Demo ML Lifecycle Management: A Context-Rich Application Big Data Context Beyond Metadata The ABCs of Context flor Common Ground Architecture Model Serving Model Debugging Parsing & Featurization Catalog & Discovery Wrangling Analytics & Vis Data Quality Reproducibility Scavenging and Ingestion Search & Query Versioned Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES METAMODEL COMMON GROUND Perspectives on Data Context Wrangling Context Services Six years with Trifacta and Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI flor ML Lifecycle
  32. 44 AI is more Experimental Science than Engineering Not a

    new observation But increasingly timely Overseen by tweakers
  33. 45 The Fourth Paradigm of Science That was for “plain

    old” science! ML & AI generate combinatorially more experiments and data A Transformed Scientific Method E HAVE TO DO BETTER AT PRODUCING TOOLS to support the whole re- search cycle—from data capture and data curation to data analysis and data visualization. Today, the tools for capturing data both at the mega-scale and at the milli-scale are just dreadful. After you have captured the data, you need to curate it before you can start doing any kind of data analysis, and we lack good tools for both data curation and data analysis. Then comes the publication of the results of your research, and the published literature is just the tip of the data iceberg. By this I mean that people collect a lot of data and then reduce this down to some number of column inches in Science or Nature—or 10 pages if it is a computer science person writing. So what I mean by data iceberg is that there is a lot of data that is collected but not curated or published in any systematic way. There are some exceptions, and I think that these cases are a good place for us to look for best practices. I will talk about how the whole process of peer review has got to change and the way in which I think it is changing and what CSTB can do to help all of us get access to our research. W 1 National Research Council, http://sites.nationalacademies.org/NRC/index.htm; Computer Science and Telecom- munications Board, http://sites.nationalacademies.org/cstb/index.htm. 2 This presentation is, poignantly, the last one posted to Jim’s Web page at Microsoft Research before he went missing at sea on January 28, 2007—http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt. EDITED BY TONY HEY, STEWART TANSLEY, AND KRISTIN TOLLE | Microsoft Research Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on January 11, 20072 Jim Gray, 2007
  34. 46 We have to do better at producing tools to

    support the whole research cycle!
  35. 52 • How we develop pipelines • Undo/Redo in model

    design is lacking • Failure to detect poor methods • Version skew: easy for versions to diverge • How we use pipelines: their resources and products • Difficult to predict how models will affect system behavior • Changes to data may not be tracked or recorded • No record of who uses which resources and why • Disorganized models are easy to lose and hard to find The ML lifecycle is poorly tooled
  36. 5 4 Disorganized models are easy to lose and hard

    to find Models will likely be organized by an individual’s standard, but not by an organization’s standards. https://xkcd.com/1459/ http://dilbert.com/strip/2011-04-23
  37. 55 Failure to detect poor methods • Data dredging or

    P-hacking • Weak isolation of test data • Training on attributes that are unknown during testing time Nature, 25 May 2016 https://xkcd.com/882/
  38. 56 Goals for flor 1. Enable safe and agile exploration

    of alternative model designs 2. Passively track and sync the history and versions of a pipeline and its executions across multiple machines 3. Answer questions about the history and provenance, and procure artifacts from the versions • Approach: • Build a system to leverage widely used tools in a principled manner.
  39. 57 Flor lives Above Ground • Unlike Ground, Flor is

    “opinionated”. • Three basic subclasses of Node Artifact Literal Action
  40. 58 Flor lives Above Ground • Unlike Ground, Flor is

    “opinionated”. • Three basic subclasses of Node • And Edges to capture workflow • Artifacts to Actions, Literals to Actions, Actions to Artifacts Artifact Literal Action Artifact
  41. 59 Action Artifact Literal Artifact Flor lives Above Ground •

    Unlike Ground, Flor is “opinionated”. • Three basic subclasses of Node • And Edges to capture workflow • Artifacts to Actions, Literals to Actions, Actions to Artifacts • Versions of Artifacts/Literals/Edges • git hashes Artifact Literal Action Artifact #86a1c71bc #cde3e1c51 #cde3e1c5 #86a1c71b #86a1c71bc #cde3e1c51 #86a1c71bc #cde3e1c51
  42. 6 0 Action Artifact Literal Artifact Flor lives Above Ground

    • Unlike Ground, Flor is “opinionated”. • Three basic subclasses of Node • And Edges to capture workflow • Artifacts to Actions, Literals to Actions, Actions to Artifacts • Versions of Artifacts/Literals/Edges • git hashes • Lineage Edges: • Track ArtifactVersions generated in workflows Artifact Literal Action Artifact #86a1c71bc #cde3e1c51 #cde3e1c5 #86a1c71b #86a1c71bc #cde3e1c51 #86a1c71bc #cde3e1c51
  43. 62

  44. 6 4 When the pipeline executes... • Data Versioning: All

    artifacts are versioned in Git, and associated with their respective experiments • New run = New commit • Metadata versioning: git history reflected in ground. ArtifactVersions autogenerated to track git commits. • Provenance: The provenance relationships between objects (artifacts or otherwise) are recorded in Ground
  45. 65 Because we version and record data context... • Materialize

    any artifact, in context • Know which artifact to materialize • Replay all previous experiments, with new data • [Opportunity] Sync local and remote versions of the pipeline, run the pipeline anywhere
  46. 67

  47. 6 9 When the pipeline executes... • Data Versioning: All

    artifacts are versioned in Git, and associated with their respective experiments • Metadata versioning: git history reflected in ground. ArtifactVersions autogenerated to track git commits. • Provenance: The provenance relationships between objects (artifacts or otherwise) are recorded in Ground • Parallel multi-trial experiments • Our example (3x): num_est=15, num_est=20, num_est=30.
  48. 7 0 Because we declare and track a literal... •

    Materialize any artifact, in richer context • Know which artifact to materialize • Replay all previous experiments, with new data • [Opportunity] Sync local and remote versions of the pipeline, run the pipeline anywhere • [Opportunity] Scripting, set literal from the command line or externally
  49. 72

  50. 73

  51. 74 3x Taxi.ipynb test Taxi.ipynb train Taxi.ipynb split Taxi.ipynb preproc

    Taxi.ipynb calculate_distance Taxi.ipynb dataframize train.csv num_estimators xTrain.pkl xTest.pkl yTrain.pkl yTest.pkl train_ready.pkl train_dist_df.pkl train_df.pkl score.txt rmse.txt model.pkl
  52. 75 When the pipeline executes... • Versioning: All artifacts are

    versioned in Git, and associated with their respective experiments • New run = New commit • Provenance: The relationships between objects, artifacts or otherwise, are recorded in Ground • Parallel multi-trial experiments • Trial invariant artifacts don’t have to be recomputed
  53. 76 Because we built a pipeline with Flor... • Materialize

    any artifact, in richer context • Know which artifact to materialize • Replay all previous experiments, with new data • Share resources, with the corresponding changes • Swap components • Maintain the pipeline • [Opportunity] Inter-version Parallelism • [Opportunity] Undo/Redo
  54. We automatically track all the metadata, context, and lineage with

    • Timestamps • Which resources your experiment used • How many trials your experiment ran • What the configuration was per trial • The evolution of your experiment over time (versions) • The lineage that derived any artifact in the workflow • The metadata you need to retrieve a physical copy of any artifact in the workflow, ever • The current state of your experiment in the file system, in context • Whether you’ve forked any experiment resources, and which ones • When you executed an experiment, whether you executed it to completion, or only partially • Whether you’ve peeked at intermediary results during interactive pipeline development, and what you did in Flor after you learned this information • Whether you peek at the same result multiple times, or each time peek at a different trial and see a different result • The location of the peeked artifacts so they may be re-used in future computations without repeating work • Whether two specifications belonging to the same experiment used the same or different resources, and whether they derived the same artifacts. • Whether any resource or artifact was renamed • ….
  55. 79 Perspectives on Data Context Six years with Trifacta and

    Google Cloud Dataprep The Common Ground model, and Ground system Managing the rise of Empirical AI Wrangling Context Services flor ML Lifecycle
  56. Opportunity At All Levels Application-Specific Context Generation and Mining Data

    Context Modeling Systems Infrastructure for Data Context Management
  57. 82 We have to do better at producing tools to

    support the whole research cycle! One of the most high-impact (and fun!) topics in CS today.