Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Time to go meta (on use)

Time to go meta (on use)

Strata NYC 2015 talk urging the Big Data community to focus on open metadata services.

Joe Hellerstein

September 30, 2015
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. What is Metadata? •  Data about data •  This used

    to be so simple! •  But .. schema on use! •  One of many changes
  2. —JIM GRAY One of the things that my research advisor

    Mike Harrison taught me to do is to   WRITE THINGS DOWN.
  3. —MANUEL BLUM Finite Automata: Limited Turing Machines: much more powerful

    What’s the difference? Turing Machines can write things down. http://www.cs.cmu.edu/~mblum/research/pdf/grad.html HT Lindsey Kuper @lindsey AN ASIDE
  4. —JIM GRAY One of the things that my research advisor

    Mike Harrison taught me to do is to   WRITE THINGS DOWN.
  5. —JIM GRAY One of the things that my research advisor

    Mike Harrison taught me to do is to   WRITE THINGS DOWN. I’M IN THE FLOW. WRITE THINGS DOWN. TENSION
  6. I’M IN THE FLOW. Mih   Challenge Level Skill Level

    WORRY APATHY BOREDOM RELAXATION CONTROL FLOW ANXIETY AROUSAL The flow state LOW HIGH HIGH LOW —Mihaly Csikszentmihalyi: Flow: The Psychology of Optimal Experience
  7. You will never know your data better than when you

    are wrangling and analyzing it. The flow state
  8. Taking Action: Football •  Video data Annotations. •  Passive metadata:

    sensor streams •  NFL + MS = Cool. •  Metadata + Simulation •  NFL + MS + EA = POV.
  9. Relationships Master Data on Customers Call detail from HDFS Data

    Wrangling Script Python Numpy Monica’s Churn Analysis Hypothesis Wrangle
  10. Python v2.7 Numpy v1.9.3 Wrangle v3.0 Versioned Relationships Master Data

    on Customers MDM 10/11/15 Call detail from HDFS v1.26 Data Wrangling Script git hash 0x6987a68a9876b7 Monica’s Churn Analysis git hash 0x987667e876f033 Hypothesis
  11. Common ground? •  Burgeoning SW market •  n2 connections? • 

    Common formats must emerge •  Need a shared place to Write it down, Link it up •  Critical to market health!
  12. An Open, Organic Imperative •  Metadata as Persistent Protocol • 

    Let the Internet be our guide •  Shared but lightweight •  All other standards-on-use
  13. Interpretation Analysis • tap the flow • fill in the rest metadata-on-use

    Interoperability • metadata as protocol • Postel’s law standards-on-use
  14. CASE: IoT TinyDB: Berkeley 2003 •  The world is a

    database! •  But the world is continuous—sensors sample. •  Acquisitional: Data on Read! •  “The Data” is not the whole picture by any means. S.R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. “The Design of an Acquisitional Query Processor for Sensor Networks”. SIGMOD 2003.
  15. CASE: IoT •  Models: another kind of metadata •  Physical

    •  Statistical (i.e. code/parameters) •  “We fill in the rest.” This is how. •  Usage: another kind of metadata •  Models determine what data to gather •  Data conditions the model parameters A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. “Model-Driven Data Acquisition in Sensor Networks”. VLDB 2004.
  16. Reproducibility Analysis • tap the flow • fill in the rest metadata-on-use

    Interoperability • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use
  17. Reproducibility They’re cooking the books. They’re actually adjusting the numbers.

    Enron used to do their books the same way. — TED CRUZ, 2015
  18. Reproducibility General population data (“1000 genomes”) Compare Cluster Patient Data

    Put leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients
  19. Data Lineage Back to tissue and bar codes on slides!

    Logical vs. Physical •  Atoms + Data + Code •  Black box: executable. (Container/VM) •  Gray box: implementation. (Python) •  White box: logic. (SQL, Wrangle)
  20. It gets messier General population data (“1000 genomes”) Compare Put

    leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients Cluster Parameters
  21. It gets messier General population data (“1000 genomes”) Compare Put

    leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients Parameter Sweep Cluster Parameters
  22. Analysis • tap the flow • fill in the rest metadata-on-use Interoperability

    • metadata as protocol • Postel’s law standards-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective Interpretation • models interpret data • data conditions models (data + metadata)-on-use
  23. Back at the Enterprise We’re talking Governance. •  And self-service

    for business users! What is to be done? •  Prevention or audit? •  Determinism or probability?
  24. Back at the Enterprise We’re talking Governance. •  And self-service

    for business users What is to be done? •  Audit or Prevention? •  Probability or Determinism?
  25. CASE: Jupyter Notebook •  An electronic lab notebook •  Evolution

    of iPython Notebook •  Writing it down since 2011
  26. Running a Class from Notebooks Assignments are notebooks •  Students

    create versions •  The solution is a version Grading •  Execute each notebook on some data •  Annotating the notebook with grades •  Updating a grades spreadsheet
  27. Homework Governance Skools ’n rools! •  Students can’t see each

    others’ HW •  Students can’t see solution •  Unless they’ve turned in theirs and it’s after April 12 and they have a Berkeley login •  Graders can’t see student names •  Students can’t update grade spreadsheet
  28. Collective Intelligence Rules should be a small part of school.

    If we do things well… •  People get smarter •  Educational software gets smarter •  Organizations get smarter Fueled by observing, learning, iterating. Write things down, fill in later.
  29. Collective, Intelligent Governance By the people. Grassroots governance. •  Sandbox

    → Annotations → Awareness → Reuse → Debate → Consensus For the people. •  Data stewards. (Data gardeners!) Collective Intelligence emerges. http://blogs.forrester.com/michele_goetz/15-09-24-are_data_preparation_tools_changing_data_governance
  30. Analysis • tap the flow • fill in the rest metadata-on-use Interoperability

    • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective • by & for the people • collective intelligence governance-on-use Relationships & Time
  31. Python v2.7 Numpy v1.9.3 Wrangle v3.0 Recall Versions Master Data

    on Customers MDM 10/11/15 Call detail from HDFS v1.26 Data Wrangling Script git hash 0x6987a68a9876b7 Monica’s Churn Analysis git hash 0x987667e876f033 Hypothesis
  32. Version Vectors Master Data on Customers MDM 10/11/15 Wrangle Script

    git hash 0x6987a68a9876b7 Python v2.7 Hypothesis Call detail from HDFS v1.26 Numpy v1.9.3 Monica’s Churn Analysis git hash 0x987667e876f033 Wrangle v3.0
  33. Time Travel •  Reproducibility = time travel! •  And time

    surfing •  Longitudinal analysis. •  And alternate histories: what-if. Major challenges, opportunities!
  34. Relationships & Time • versioned relationships • time travel & surfing • changing

    history infinite reuse Analysis • tap the flow • fill in the rest metadata-on-use Interoperability • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective • by & for the people • collective intelligence governance-on-use
  35. Metadata Services Big Data at the Crossroads Make it easy

    to write things down •  Be liberal in what you accept •  Foster connections across SW •  Open repos, common APIs Evolving technology •  We’re just at the beginning! •  Recall football and cancer
  36. A Call to Community It’s time for open metadata services

    •  Science, Industry, Entertainment, Education… The risk of inaction is significant •  Friction in the big data market •  Social trust in science The potential is phenomenal •  Enabling collective intelligence. •  Mastering time