Slide 1

Slide 1 text

People, Computers, and the 
 Hot Mess of Real Data Joe Hellerstein

Slide 2

Slide 2 text

WHO AM I 2 ?

Slide 3

Slide 3 text

THE MISSING THIRD INGREDIENT: PEOPLE 3 Research imperative: 
 Dramatically simplify labor-intensive tasks … in the analytic lifecycle. 2010 Computing is free. Storage is free. Data is abundant. The remaining bottlenecks lie with people.

Slide 4

Slide 4 text

A SIDE PROJECT 4 dp = datapeople http://deepresearch.org

Slide 5

Slide 5 text

dp (c. 2012) 5 Jeff Heer
 Stanford Tapan Parikh Berkeley Maneesh Agrawala Berkeley Joe Hellerstein Berkeley Sean Diana Ravi Kandel MacLean Parikh Kuang Nicholas Wesley
 Chen Kong Willett

Slide 6

Slide 6 text

THE ANALYTIC LIFECYCLE 6 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION

Slide 7

Slide 7 text

THE ANALYTIC LIFECYCLE 7 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION KDD, SIGMOD, SOSP, NIPS, etc.

Slide 8

Slide 8 text

THE ANALYTIC LIFECYCLE 8 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION

Slide 9

Slide 9 text

THE ANALYTIC LIFECYCLE 9 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION Shreddr Wrangler MADlib d3 [Chen et al., DEV12] [Kandel, et al. CHI 11] [Hellerstein, et al. VLDB 12] [Bostock et al. Infovis 11] CommentSpace [Willett et al. CHI 11]

Slide 10

Slide 10 text

THE ANALYTIC LIFECYCLE 10 Shreddr Wrangler MADlib d3 CommentSpace ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION

Slide 11

Slide 11 text

THREE CHAPTERS ➔ Data Acquisition. (Shreddr —> Captricity) ➔ Data Wrangling (Potter’s Wheel —> Wrangler —> Trifacta) ➔ Data Context (Ground) 11

Slide 12

Slide 12 text

THE ANALYTIC LIFECYCLE 12 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION

Slide 13

Slide 13 text

Data in the First Mile

Slide 14

Slide 14 text

14 Extracting value from data without waiting for infrastructure

Slide 15

Slide 15 text

15 Shreddr

Slide 16

Slide 16 text

16 Shreddr: Columnar Data Entry & Confirmation

Slide 17

Slide 17 text

Select the values are not: Michael 17 Shreddr: Columnar Data Entry & Confirmation

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

ANALYTICS ENABLEMENT
 Extracting Data from 1M+ Death Claims 19 CHALLENGE… No easy access to “cause of death” data 100’s of templates to identify, sort and capture UNLOCKED Improve fraud detection by leveraging patterns found in historical customer data

Slide 20

Slide 20 text

20

Slide 21

Slide 21 text

SOME LESSONS ➔(Problems from the field) × (Ideas from the lab) ➔Apply systems ideas to remove UX bottlenecks ➔Column compression ➔Batch processing & instruction locality ➔Filter pipelines ➔Crowdsourcing: first hints of Human/Machine collaboration ➔Humans as algorithmic agents ➔Challenge: optimize the human work

Slide 22

Slide 22 text

THE ANALYTIC LIFECYCLE ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION COLLABORATION ACQUISITION

Slide 23

Slide 23 text

DATA WRANGLING: A USER-CENTRIC TASK 23 Designing For Humans Not Designing for SciFi

Slide 24

Slide 24 text

Talk to Humans

Slide 25

Slide 25 text

WHERE DOES THE TIME GO IN ANALYTICS? PROCESSING ANALYTICS 80% of the work in any data project is preparing the data. Patil, Data Jujitsu, 2012. Kandel et al. “Enterprise Data Analysis and Visualization: An Interview Study”, IEEE VAST, 2012.

Slide 26

Slide 26 text

Interview study of 35 analysts: 25 companies Healthcare Retail, Marketing Social networking Media Finance, Insurance Various titles Data analyst Data scientist Software engineer Consultant Chief technical officer [Kandel et al., VAST12] KANDEL SURVEY 26

Slide 27

Slide 27 text

“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis. Most of the time I’m lucky if I get to do any ‘analysis’ at all.” Friction “Most of the time once you transform the data ... the insights can be scarily obvious.” Lost potential

Slide 28

Slide 28 text

“It’s easy to just think you know what you are doing and not look at data at every intermediary step. An analysis has 30 different steps. It’s tempting to just do this then that and then this. You have no idea in which ways you are wrong and what data is wrong.” Interactivity and Visualization

Slide 29

Slide 29 text

29

Slide 30

Slide 30 text

A PROGRAMMING PROBLEM THE DATA TRANSFORMATION PROBLEM 30 DATA TRANSFORMATION Business System Data Machine Generated Data Log Data Data Visualization Fraud Detection Recommendations DATA SOURCE Complexity DATA PRODUCT Simplicity … …

Slide 31

Slide 31 text

TRANSFORMATION PROGRAMMING Languages: Python, Bash, Ruby, Perl… DSLs: DataStep, AJAX, Pandas, dplyr, Wrangle, Ibis… 31 Domain Specific Language (DSL) Data Output write code, compile, run

Slide 32

Slide 32 text

POTTER’S WHEEL (2001): ENTER THE VISUAL ➔ Visual DSL ➔ Immediate feedback ➔ Ongoing discrepancy detection ➔ Data lineage, redo/undo 32 [Raman & Hellerstein, VLDB11]

Slide 33

Slide 33 text

Lifting from DSL to Visual Language 33 Domain Specific Language (DSL) Data Output write code, compile, run Visualization and Interaction View Result visualize interact Lift Ground compile Problem: Remaining burden of specification for users.

Slide 34

Slide 34 text

My software doesn’t understand what I’m trying to do.

Slide 35

Slide 35 text

I don’t (yet) know what I’m trying to do.

Slide 36

Slide 36 text

HINTS OF INTELLIGENT INTERFACES Type-ahead uses context and data to predict search terms and preview results.

Slide 37

Slide 37 text

SEARCH QUERY AUTO-COMPLETE 37 Search Engine Query Textbox Query Response Suggestions pick type GUIDE DECIDE predict What about more complex input/output relations? The input and output domains are the same: text.

Slide 38

Slide 38 text

WRANGLER (2011): ADD INTELLIGENCE 38 [Kandel, et al. CHI 11] [Guo, et al. UIST11] ➔ Automatic inference of transforms ➔ Predictive preview of results ➔ Interactive history ➔ User Studies http://vis.stanford.edu/wrangler

Slide 39

Slide 39 text

TRADITIONAL DATA TRANSFORMATION 39 Visualization and Interaction Data Transformation Code User authors a draft transformation script User tests the script on a small amount of data User inspects output data to assess effects 1. 2. 3.

Slide 40

Slide 40 text

Trifacta. Confidential & Proprietary. PREDICTIVE INTERACTION 40 Visualization and Interaction Data Transformation Code User highlights visual features of the data Data previews allow user to choose, adjust and confirm Algorithms predict a ranked list of scalable transforms 1. 3. 2. GUIDE DECIDE

Slide 41

Slide 41 text

PREDICTIVE INTERACTION 41 Domain Specific Language (DSL) Visualization and Interaction Data Output write code, compile, run View Result visualize compile Response Preview pick interact predict GUIDE DECIDE codegen present Lift Ground [Heer, Hellerstein, Kandel, CIDR15]

Slide 42

Slide 42 text

Empowering businesses to innovate with data.

Slide 43

Slide 43 text

Wrangling Web Chat Log Data 43 Business Challenge: Understanding web chat interactions to personalize the customer experience Data Challenge: Only 0.01% of web chat logs analyzed due to complexity • Large volumes of unstructured, difficult to prep, web chat data being created • Only 200 chats manually extracted per month and analyzed for quality assurance • Valuable frontline time taken up by manual processing • Limited insight into what their customers are speaking to them about • In retail banking, web-based self- service has surpassed both in person and call center usage • At RBS, 250,000 customer chats per month launched for multiple banking needs • Analyzing web chat data can provide valuable information about customer needs and pain points Trifacta: Providing a self-service solution to wrangle 100% of logs • 100% of web chat logs now prepped and analyzed • Went from processing 200 logs to 250,000 logs…and now automated, not manual! • Have new insight into customer needs

Slide 44

Slide 44 text

© 2016 Royal Bank of Scotland Group. All rights Reserved The classification of this document is PUBLIC. “The dashboard is transforming the way I run my business. It is improving the customer-centric approach in our chats and it is showing in the output that we now see” Akshay Vats - Head of Web Chat Operation (India) Empowering RBS’s frontline staff

Slide 45

Slide 45 text

SOME LESSONS ➔Predictive Interaction: Guide and Decide ➔A UX model for AI-assisted, human-driven tasks ➔DSLs at the center ➔A formal “narrow waist” ➔Targetable to multiple runtimes ➔Provides a modest, factored search space for learning & prediction ➔Interactive Profiling ➔Continuous data vis feedback during transformation ➔Data profile qua data interface

Slide 46

Slide 46 text

46 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION

Slide 47

Slide 47 text

47 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY ACQUISITION CONTEXT

Slide 48

Slide 48 text

48 ACQUISITION TRANSFORMATION ANALYSIS VISUALIZATION DECIDE/DEPLOY CONTEXT

Slide 49

Slide 49 text

A broader context for big data ground

Slide 50

Slide 50 text

ground A broader context for big data

Slide 51

Slide 51 text

WHAT CHANGED WITH BIG DATA? Shift in technology
 Data representations Shift in behavior
 Data-driven organizations

Slide 52

Slide 52 text

Shift in behavior
 Data-driven organizations

Slide 53

Slide 53 text

By 2017: 
 marketing spends more on tech than IT. Data escapes IT GARTNER GROUP

Slide 54

Slide 54 text

By 2017: 
 marketing spends more on tech than IT. Data escapes IT GARTNER GROUP By 2020: 
 90% of IT budget controlled outside of IT.

Slide 55

Slide 55 text

MANY USE CASES MANY CONSTITUENCIES MANY INCENTIVES MANY CONTEXTS

Slide 56

Slide 56 text

Shift in technology
 Data representations

Slide 57

Slide 57 text

What does it
 mean? It depends on
 the context. Raw data in the data lake
 Simplifies capture Encourages exploration

Slide 58

Slide 58 text

MANY SCRIPTS MANY MODELS MANY APPLICATIONS MANY CONTEXTS

Slide 59

Slide 59 text

It’s time to establish a bigger context for big data. Historical context
 Because
 things change Behavioral context
 Because behavior determines meaning Application context Because truth
 is subjective THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT

Slide 60

Slide 60 text

APPLICATION CONTEXT Metadata Models for interpreting
 the data for use • Data structures • Semantic structures • Statistical structures Theme: services must provide an unopinionated model of context

Slide 61

Slide 61 text

HISTORICAL CONTEXT Versions Web logs Code to extract user/ movie rentals Recommender for movie licensing Point in time
 A promising new
 movie is similar to older hot movies at time of release! Trends over time
 How does a movie
 with these features
 fare over time?

Slide 62

Slide 62 text

BEHAVIORAL CONTEXT Why Dora?! Lineage & Usage

Slide 63

Slide 63 text

2 4 8 7 9 BEHAVIORAL CONTEXT Lineage & Usage Data Science Recommenders “You should compare with book sales from last year.” Curation Tips “Logistics staff checks weather data the 1st Monday of every month.” Proactive
 Impact Analysis “The Twitter analysis script changed. You should check the boss’ dashboard!”

Slide 64

Slide 64 text

7 7 9 9 THE BIG CONTEXT A NEW WORLD NEEDS NEW SERVICES

Slide 65

Slide 65 text

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth

Slide 66

Slide 66 text

COMMON GROUND Version-Model-Lineage (VML) Graphs Model Graphs Version Graphs Usage Graphs: Lineage

Slide 67

Slide 67 text

ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES CONTEXT MODEL COMMON GROUND Parsing &
 Featurization Catalog &
 Discovery Wrangling Analytics &
 Vis Reference
 Data Data
 Quality Reproducibility Model
 Serving Scavenging
 and Ingestion Search &
 Query Scheduling &
 Workflow Versioned
 Storage ID & Auth ABOVEGROUND API TO APPLICATIONS UNDERGROUND API TO SERVICES RESEARCH OPPORTUNITIES ACROSS THE STACK

Slide 68

Slide 68 text

IN SUM: PEOPLE + DATA + COMPUTATION ➔Dealing with Data: involves much more than algorithms ➔Human Component: a huge opportunity for tech innovation ➔Context is Key: for grounding analysis 68

Slide 69

Slide 69 text

@joe_hellerstein [email protected]