Time to go meta (on use)

Time to go meta (on use)

Strata NYC 2015 talk urging the Big Data community to focus on open metadata services.

Fb47910b51938c597b6ed6291206cb6e?s=128

Joe Hellerstein

September 30, 2015
Tweet

Transcript

  1. Time to go meta (on use). Joe Hellerstein Berkeley |

    Trifacta
  2. Community Health Metadata & Data Management Data Analysis Data Wrangling

  3. Community Health Metadata & Data Management Data Analysis Data Wrangling

    FAIL
  4. What is Metadata? •  Data about data •  This used

    to be so simple!
  5. What is Metadata? •  Data about data •  This used

    to be so simple! •  But .. schema on use! •  One of many changes
  6. Analysis Interoperability Interpretation Reproducibility Governance & The Collective Many kinds

    of metadata.
  7. Analysis

  8. Case: Data Analysis Wrangle Visualize Analyze Data Results METAMNESIA

  9. —JIM GRAY One of the things that my research advisor

    Mike Harrison taught me to do is to   WRITE THINGS DOWN.
  10. —MANUEL BLUM Finite Automata: Limited Turing Machines: much more powerful

    What’s the difference? Turing Machines can write things down. http://www.cs.cmu.edu/~mblum/research/pdf/grad.html HT Lindsey Kuper @lindsey AN ASIDE
  11. —JIM GRAY One of the things that my research advisor

    Mike Harrison taught me to do is to   WRITE THINGS DOWN.
  12. —JIM GRAY One of the things that my research advisor

    Mike Harrison taught me to do is to   WRITE THINGS DOWN. I’M IN THE FLOW. WRITE THINGS DOWN. TENSION
  13. I’M IN THE FLOW. Mih   Challenge Level Skill Level

    WORRY APATHY BOREDOM RELAXATION CONTROL FLOW ANXIETY AROUSAL The flow state LOW HIGH HIGH LOW —Mihaly Csikszentmihalyi: Flow: The Psychology of Optimal Experience
  14. You will never know your data better than when you

    are wrangling and analyzing it. The flow state
  15. Stop. Write it down. Dam the flow

  16. Damn the flow Go. Curse the lost metadata. Stop. Write

    it down. Dam the flow
  17. TAKE ACTION Data Analytics Infra team: “Write down what you

    can, we’ll fill in the rest.”
  18. Taking Action: Football •  Video data Annotations. •  Metadata from

    manual annotation
  19. Taking Action: Football •  Video data Annotations. •  Passive metadata:

    sensor streams •  NFL + MS = Cool.
  20. Taking Action: Football •  Video data Annotations. •  Passive metadata:

    sensor streams •  NFL + MS = Cool. •  Metadata + Simulation •  NFL + MS + EA = POV.
  21. Capture what people do with data. Augment as appropriate. Interpolate

    as needed. Taking Action: Data Analysis
  22. Analysis • tap the flow • fill in the rest metadata-on-use Interoperability

  23. CASE: Data Debugging

  24. CASE: Data Debugging

  25. None
  26. Relationships Master Data on Customers Call detail from HDFS Data

    Wrangling Script Python Numpy Monica’s Churn Analysis Hypothesis Wrangle
  27. Python v2.7 Numpy v1.9.3 Wrangle v3.0 Versioned Relationships Master Data

    on Customers MDM 10/11/15 Call detail from HDFS v1.26 Data Wrangling Script git hash 0x6987a68a9876b7 Monica’s Churn Analysis git hash 0x987667e876f033 Hypothesis
  28. Common ground? •  Burgeoning SW market •  n2 connections? • 

    Common formats must emerge •  Need a shared place to Write it down, Link it up •  Critical to market health!
  29. An Open, Organic Imperative •  Metadata as Persistent Protocol • 

    Let the Internet be our guide •  Shared but lightweight •  All other standards-on-use
  30. Postel’s Law Be conservative in what you do, be liberal

    in what you accept from others
  31. Interpretation Analysis • tap the flow • fill in the rest metadata-on-use

    Interoperability • metadata as protocol • Postel’s law standards-on-use
  32. CASE: IoT TinyDB: Berkeley 2003 •  The world is a

    database! •  But the world is continuous—sensors sample. •  Acquisitional: Data on Read! •  “The Data” is not the whole picture by any means. S.R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. “The Design of an Acquisitional Query Processor for Sensor Networks”. SIGMOD 2003.
  33. CASE: IoT •  Models: another kind of metadata •  Physical

    •  Statistical (i.e. code/parameters) •  “We fill in the rest.” This is how. •  Usage: another kind of metadata •  Models determine what data to gather •  Data conditions the model parameters A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. “Model-Driven Data Acquisition in Sensor Networks”. VLDB 2004.
  34. Reproducibility Analysis • tap the flow • fill in the rest metadata-on-use

    Interoperability • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use
  35. Can metadata cure cancer?

  36. (No.)

  37. But it’s going to be useful.

  38. Case: Cancer Genomics General population data (“1000 genomes”) Compare Cluster

    Patient Data
  39. Reproducibility They’re cooking the books. They’re actually adjusting the numbers.

    Enron used to do their books the same way. — TED CRUZ, 2015
  40. Reproducibility General population data (“1000 genomes”) Compare Cluster Patient Data

  41. Reproducibility General population data (“1000 genomes”) Compare Cluster Patient Data

    Put leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients
  42. Data Lineage Back to tissue and bar codes on slides!

    Logical vs. Physical •  Atoms + Data + Code •  Black box: executable. (Container/VM) •  Gray box: implementation. (Python) •  White box: logic. (SQL, Wrangle)
  43. It gets messier General population data (“1000 genomes”) Compare Put

    leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients Cluster Parameters
  44. It gets messier General population data (“1000 genomes”) Compare Put

    leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients Parameter Sweep Cluster Parameters
  45. I have not failed. I've just found 10,000 ways that

    won't work. — THOMAS EDISON
  46. None
  47. Analysis • tap the flow • fill in the rest metadata-on-use Interoperability

    • metadata as protocol • Postel’s law standards-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective Interpretation • models interpret data • data conditions models (data + metadata)-on-use
  48. Back at the Enterprise We’re talking Governance. •  And self-service

    for business users! What is to be done? •  Prevention or audit? •  Determinism or probability?
  49. None
  50. Back at the Enterprise We’re talking Governance. •  And self-service

    for business users What is to be done? •  Audit or Prevention? •  Probability or Determinism?
  51. CASE: Jupyter Notebook •  An electronic lab notebook •  Evolution

    of iPython Notebook •  Writing it down since 2011
  52. Running a Class from Notebooks Assignments are notebooks •  Students

    create versions •  The solution is a version Grading •  Execute each notebook on some data •  Annotating the notebook with grades •  Updating a grades spreadsheet
  53. Homework Governance Skools ’n rools! •  Students can’t see each

    others’ HW •  Students can’t see solution •  Unless they’ve turned in theirs and it’s after April 12 and they have a Berkeley login •  Graders can’t see student names •  Students can’t update grade spreadsheet
  54. Collective Intelligence Rules should be a small part of school.

    If we do things well… •  People get smarter •  Educational software gets smarter •  Organizations get smarter Fueled by observing, learning, iterating. Write things down, fill in later.
  55. So ... Enterprise Governance?

  56. Collective, Intelligent Governance By the people. Grassroots governance. •  Sandbox

    → Annotations → Awareness → Reuse → Debate → Consensus For the people. •  Data stewards. (Data gardeners!) Collective Intelligence emerges. http://blogs.forrester.com/michele_goetz/15-09-24-are_data_preparation_tools_changing_data_governance
  57. Analysis • tap the flow • fill in the rest metadata-on-use Interoperability

    • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective • by & for the people • collective intelligence governance-on-use Relationships & Time
  58. Python v2.7 Numpy v1.9.3 Wrangle v3.0 Recall Versions Master Data

    on Customers MDM 10/11/15 Call detail from HDFS v1.26 Data Wrangling Script git hash 0x6987a68a9876b7 Monica’s Churn Analysis git hash 0x987667e876f033 Hypothesis
  59. Version Vectors Master Data on Customers MDM 10/11/15 Wrangle Script

    git hash 0x6987a68a9876b7 Python v2.7 Hypothesis Call detail from HDFS v1.26 Numpy v1.9.3 Monica’s Churn Analysis git hash 0x987667e876f033 Wrangle v3.0
  60. Relationships Take Time Hypothesis

  61. Time Travel •  Reproducibility = time travel! •  And time

    surfing •  Longitudinal analysis. •  And alternate histories: what-if. Major challenges, opportunities!
  62. Relationships & Time • versioned relationships • time travel & surfing • changing

    history infinite reuse Analysis • tap the flow • fill in the rest metadata-on-use Interoperability • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective • by & for the people • collective intelligence governance-on-use
  63. Metadata Services Big Data at the Crossroads Make it easy

    to write things down •  Be liberal in what you accept •  Foster connections across SW •  Open repos, common APIs Evolving technology •  We’re just at the beginning! •  Recall football and cancer
  64. A Call to Community It’s time for open metadata services

    •  Science, Industry, Entertainment, Education… The risk of inaction is significant •  Friction in the big data market •  Social trust in science The potential is phenomenal •  Enabling collective intelligence. •  Mastering time