Slide 1

Slide 1 text

Time to go meta (on use). Joe Hellerstein Berkeley | Trifacta

Slide 2

Slide 2 text

Community Health Metadata & Data Management Data Analysis Data Wrangling

Slide 3

Slide 3 text

Community Health Metadata & Data Management Data Analysis Data Wrangling FAIL

Slide 4

Slide 4 text

What is Metadata? •  Data about data •  This used to be so simple!

Slide 5

Slide 5 text

What is Metadata? •  Data about data •  This used to be so simple! •  But .. schema on use! •  One of many changes

Slide 6

Slide 6 text

Analysis Interoperability Interpretation Reproducibility Governance & The Collective Many kinds of metadata.

Slide 7

Slide 7 text

Analysis

Slide 8

Slide 8 text

Case: Data Analysis Wrangle Visualize Analyze Data Results METAMNESIA

Slide 9

Slide 9 text

—JIM GRAY One of the things that my research advisor Mike Harrison taught me to do is to   WRITE THINGS DOWN.

Slide 10

Slide 10 text

—MANUEL BLUM Finite Automata: Limited Turing Machines: much more powerful What’s the difference? Turing Machines can write things down. http://www.cs.cmu.edu/~mblum/research/pdf/grad.html HT Lindsey Kuper @lindsey AN ASIDE

Slide 11

Slide 11 text

—JIM GRAY One of the things that my research advisor Mike Harrison taught me to do is to   WRITE THINGS DOWN.

Slide 12

Slide 12 text

—JIM GRAY One of the things that my research advisor Mike Harrison taught me to do is to   WRITE THINGS DOWN. I’M IN THE FLOW. WRITE THINGS DOWN. TENSION

Slide 13

Slide 13 text

I’M IN THE FLOW. Mih   Challenge Level Skill Level WORRY APATHY BOREDOM RELAXATION CONTROL FLOW ANXIETY AROUSAL The flow state LOW HIGH HIGH LOW —Mihaly Csikszentmihalyi: Flow: The Psychology of Optimal Experience

Slide 14

Slide 14 text

You will never know your data better than when you are wrangling and analyzing it. The flow state

Slide 15

Slide 15 text

Stop. Write it down. Dam the flow

Slide 16

Slide 16 text

Damn the flow Go. Curse the lost metadata. Stop. Write it down. Dam the flow

Slide 17

Slide 17 text

TAKE ACTION Data Analytics Infra team: “Write down what you can, we’ll fill in the rest.”

Slide 18

Slide 18 text

Taking Action: Football •  Video data Annotations. •  Metadata from manual annotation

Slide 19

Slide 19 text

Taking Action: Football •  Video data Annotations. •  Passive metadata: sensor streams •  NFL + MS = Cool.

Slide 20

Slide 20 text

Taking Action: Football •  Video data Annotations. •  Passive metadata: sensor streams •  NFL + MS = Cool. •  Metadata + Simulation •  NFL + MS + EA = POV.

Slide 21

Slide 21 text

Capture what people do with data. Augment as appropriate. Interpolate as needed. Taking Action: Data Analysis

Slide 22

Slide 22 text

Analysis • tap the flow • fill in the rest metadata-on-use Interoperability

Slide 23

Slide 23 text

CASE: Data Debugging

Slide 24

Slide 24 text

CASE: Data Debugging

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Relationships Master Data on Customers Call detail from HDFS Data Wrangling Script Python Numpy Monica’s Churn Analysis Hypothesis Wrangle

Slide 27

Slide 27 text

Python v2.7 Numpy v1.9.3 Wrangle v3.0 Versioned Relationships Master Data on Customers MDM 10/11/15 Call detail from HDFS v1.26 Data Wrangling Script git hash 0x6987a68a9876b7 Monica’s Churn Analysis git hash 0x987667e876f033 Hypothesis

Slide 28

Slide 28 text

Common ground? •  Burgeoning SW market •  n2 connections? •  Common formats must emerge •  Need a shared place to Write it down, Link it up •  Critical to market health!

Slide 29

Slide 29 text

An Open, Organic Imperative •  Metadata as Persistent Protocol •  Let the Internet be our guide •  Shared but lightweight •  All other standards-on-use

Slide 30

Slide 30 text

Postel’s Law Be conservative in what you do, be liberal in what you accept from others

Slide 31

Slide 31 text

Interpretation Analysis • tap the flow • fill in the rest metadata-on-use Interoperability • metadata as protocol • Postel’s law standards-on-use

Slide 32

Slide 32 text

CASE: IoT TinyDB: Berkeley 2003 •  The world is a database! •  But the world is continuous—sensors sample. •  Acquisitional: Data on Read! •  “The Data” is not the whole picture by any means. S.R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. “The Design of an Acquisitional Query Processor for Sensor Networks”. SIGMOD 2003.

Slide 33

Slide 33 text

CASE: IoT •  Models: another kind of metadata •  Physical •  Statistical (i.e. code/parameters) •  “We fill in the rest.” This is how. •  Usage: another kind of metadata •  Models determine what data to gather •  Data conditions the model parameters A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. “Model-Driven Data Acquisition in Sensor Networks”. VLDB 2004.

Slide 34

Slide 34 text

Reproducibility Analysis • tap the flow • fill in the rest metadata-on-use Interoperability • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use

Slide 35

Slide 35 text

Can metadata cure cancer?

Slide 36

Slide 36 text

(No.)

Slide 37

Slide 37 text

But it’s going to be useful.

Slide 38

Slide 38 text

Case: Cancer Genomics General population data (“1000 genomes”) Compare Cluster Patient Data

Slide 39

Slide 39 text

Reproducibility They’re cooking the books. They’re actually adjusting the numbers. Enron used to do their books the same way. — TED CRUZ, 2015

Slide 40

Slide 40 text

Reproducibility General population data (“1000 genomes”) Compare Cluster Patient Data

Slide 41

Slide 41 text

Reproducibility General population data (“1000 genomes”) Compare Cluster Patient Data Put leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients

Slide 42

Slide 42 text

Data Lineage Back to tissue and bar codes on slides! Logical vs. Physical •  Atoms + Data + Code •  Black box: executable. (Container/VM) •  Gray box: implementation. (Python) •  White box: logic. (SQL, Wrangle)

Slide 43

Slide 43 text

It gets messier General population data (“1000 genomes”) Compare Put leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients Cluster Parameters

Slide 44

Slide 44 text

It gets messier General population data (“1000 genomes”) Compare Put leukemia cells on slide Robot puts chemistry on slides Robot puts slide on gene sequencer X 1000 patients Parameter Sweep Cluster Parameters

Slide 45

Slide 45 text

I have not failed. I've just found 10,000 ways that won't work. — THOMAS EDISON

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Analysis • tap the flow • fill in the rest metadata-on-use Interoperability • metadata as protocol • Postel’s law standards-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective Interpretation • models interpret data • data conditions models (data + metadata)-on-use

Slide 48

Slide 48 text

Back at the Enterprise We’re talking Governance. •  And self-service for business users! What is to be done? •  Prevention or audit? •  Determinism or probability?

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Back at the Enterprise We’re talking Governance. •  And self-service for business users What is to be done? •  Audit or Prevention? •  Probability or Determinism?

Slide 51

Slide 51 text

CASE: Jupyter Notebook •  An electronic lab notebook •  Evolution of iPython Notebook •  Writing it down since 2011

Slide 52

Slide 52 text

Running a Class from Notebooks Assignments are notebooks •  Students create versions •  The solution is a version Grading •  Execute each notebook on some data •  Annotating the notebook with grades •  Updating a grades spreadsheet

Slide 53

Slide 53 text

Homework Governance Skools ’n rools! •  Students can’t see each others’ HW •  Students can’t see solution •  Unless they’ve turned in theirs and it’s after April 12 and they have a Berkeley login •  Graders can’t see student names •  Students can’t update grade spreadsheet

Slide 54

Slide 54 text

Collective Intelligence Rules should be a small part of school. If we do things well… •  People get smarter •  Educational software gets smarter •  Organizations get smarter Fueled by observing, learning, iterating. Write things down, fill in later.

Slide 55

Slide 55 text

So ... Enterprise Governance?

Slide 56

Slide 56 text

Collective, Intelligent Governance By the people. Grassroots governance. •  Sandbox → Annotations → Awareness → Reuse → Debate → Consensus For the people. •  Data stewards. (Data gardeners!) Collective Intelligence emerges. http://blogs.forrester.com/michele_goetz/15-09-24-are_data_preparation_tools_changing_data_governance

Slide 57

Slide 57 text

Analysis • tap the flow • fill in the rest metadata-on-use Interoperability • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective • by & for the people • collective intelligence governance-on-use Relationships & Time

Slide 58

Slide 58 text

Python v2.7 Numpy v1.9.3 Wrangle v3.0 Recall Versions Master Data on Customers MDM 10/11/15 Call detail from HDFS v1.26 Data Wrangling Script git hash 0x6987a68a9876b7 Monica’s Churn Analysis git hash 0x987667e876f033 Hypothesis

Slide 59

Slide 59 text

Version Vectors Master Data on Customers MDM 10/11/15 Wrangle Script git hash 0x6987a68a9876b7 Python v2.7 Hypothesis Call detail from HDFS v1.26 Numpy v1.9.3 Monica’s Churn Analysis git hash 0x987667e876f033 Wrangle v3.0

Slide 60

Slide 60 text

Relationships Take Time Hypothesis

Slide 61

Slide 61 text

Time Travel •  Reproducibility = time travel! •  And time surfing •  Longitudinal analysis. •  And alternate histories: what-if. Major challenges, opportunities!

Slide 62

Slide 62 text

Relationships & Time • versioned relationships • time travel & surfing • changing history infinite reuse Analysis • tap the flow • fill in the rest metadata-on-use Interoperability • metadata as protocol • Postel’s law standards-on-use Interpretation • models interpret data • data conditions models (data + metadata)-on-use Reproducibility • instrumentation • lineage: success & failure lab notebook-on-use Governance & The Collective • by & for the people • collective intelligence governance-on-use

Slide 63

Slide 63 text

Metadata Services Big Data at the Crossroads Make it easy to write things down •  Be liberal in what you accept •  Foster connections across SW •  Open repos, common APIs Evolving technology •  We’re just at the beginning! •  Recall football and cancer

Slide 64

Slide 64 text

A Call to Community It’s time for open metadata services •  Science, Industry, Entertainment, Education… The risk of inaction is significant •  Friction in the big data market •  Social trust in science The potential is phenomenal •  Enabling collective intelligence. •  Mastering time