People, Computers, and the
Hot Mess of Real Data
Joe Hellerstein
Slide 2
Slide 2 text
WHO AM I
2
?
Slide 3
Slide 3 text
THE MISSING THIRD INGREDIENT: PEOPLE
3
Research imperative:
Dramatically simplify labor-intensive tasks … in the analytic lifecycle.
2010
Computing is free.
Storage is free.
Data is abundant.
The remaining bottlenecks lie with people.
Slide 4
Slide 4 text
A SIDE PROJECT
4
dp = datapeople
http://deepresearch.org
Slide 5
Slide 5 text
dp (c. 2012)
5
Jeff Heer
Stanford
Tapan Parikh
Berkeley
Maneesh Agrawala
Berkeley
Joe Hellerstein
Berkeley
Sean Diana Ravi
Kandel MacLean Parikh
Kuang Nicholas Wesley
Chen Kong Willett
Slide 6
Slide 6 text
THE ANALYTIC LIFECYCLE
6
ACQUISITION
TRANSFORMATION
ANALYSIS
VISUALIZATION
DECIDE/DEPLOY
ACQUISITION
Slide 7
Slide 7 text
THE ANALYTIC LIFECYCLE
7
ACQUISITION
TRANSFORMATION
ANALYSIS
VISUALIZATION
DECIDE/DEPLOY
ACQUISITION
KDD, SIGMOD, SOSP, NIPS, etc.
Slide 8
Slide 8 text
THE ANALYTIC LIFECYCLE
8
ACQUISITION
TRANSFORMATION
ANALYSIS
VISUALIZATION
DECIDE/DEPLOY
ACQUISITION
Slide 9
Slide 9 text
THE ANALYTIC LIFECYCLE
9
ACQUISITION
TRANSFORMATION
ANALYSIS
VISUALIZATION
DECIDE/DEPLOY
ACQUISITION
Shreddr
Wrangler
MADlib
d3
[Chen et al., DEV12]
[Kandel, et al. CHI 11]
[Hellerstein, et al. VLDB 12]
[Bostock et al. Infovis 11]
CommentSpace [Willett et al. CHI 11]
THREE CHAPTERS
➔ Data Acquisition. (Shreddr —> Captricity)
➔ Data Wrangling (Potter’s Wheel —> Wrangler —> Trifacta)
➔ Data Context (Ground)
11
Slide 12
Slide 12 text
THE ANALYTIC LIFECYCLE
12
ACQUISITION
TRANSFORMATION
ANALYSIS
VISUALIZATION
DECIDE/DEPLOY
ACQUISITION
Slide 13
Slide 13 text
Data in the First Mile
Slide 14
Slide 14 text
14
Extracting value from data without
waiting for infrastructure
Slide 15
Slide 15 text
15
Shreddr
Slide 16
Slide 16 text
16
Shreddr: Columnar Data Entry & Confirmation
Slide 17
Slide 17 text
Select the values are not: Michael
17
Shreddr: Columnar Data Entry & Confirmation
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
ANALYTICS ENABLEMENT
Extracting Data from 1M+ Death Claims
19
CHALLENGE…
No easy access to “cause of death” data
100’s of templates to identify, sort and capture
UNLOCKED
Improve fraud detection by leveraging patterns
found in historical customer data
Slide 20
Slide 20 text
20
Slide 21
Slide 21 text
SOME LESSONS
➔(Problems from the field) × (Ideas from the lab)
➔Apply systems ideas to remove UX bottlenecks
➔Column compression
➔Batch processing & instruction locality
➔Filter pipelines
➔Crowdsourcing: first hints of Human/Machine collaboration
➔Humans as algorithmic agents
➔Challenge: optimize the human work
Slide 22
Slide 22 text
THE ANALYTIC LIFECYCLE
ACQUISITION
TRANSFORMATION
ANALYSIS
VISUALIZATION
COLLABORATION
ACQUISITION
Slide 23
Slide 23 text
DATA WRANGLING: A USER-CENTRIC TASK
23
Designing For Humans Not Designing for SciFi
Slide 24
Slide 24 text
Talk to Humans
Slide 25
Slide 25 text
WHERE DOES THE TIME GO IN ANALYTICS?
PROCESSING
ANALYTICS
80%
of the work in any
data project is
preparing the data.
Patil, Data Jujitsu, 2012.
Kandel et al. “Enterprise
Data Analysis and
Visualization: An Interview
Study”, IEEE VAST, 2012.
Slide 26
Slide 26 text
Interview study of 35 analysts:
25 companies
Healthcare
Retail, Marketing
Social networking
Media
Finance, Insurance
Various titles
Data analyst
Data scientist
Software engineer
Consultant
Chief technical officer
[Kandel et al., VAST12]
KANDEL SURVEY
26
Slide 27
Slide 27 text
“I spend more than half of my time integrating, cleansing and
transforming data without doing any actual analysis. Most of
the time I’m lucky if I get to do any ‘analysis’ at all.”
Friction
“Most of the time once you transform the data ... the insights
can be scarily obvious.”
Lost potential
Slide 28
Slide 28 text
“It’s easy to just think you know what you are doing and not look
at data at every intermediary step.
An analysis has 30 different steps. It’s tempting to just do this
then that and then this. You have no idea in which ways you are
wrong and what data is wrong.”
Interactivity and Visualization
Slide 29
Slide 29 text
29
Slide 30
Slide 30 text
A PROGRAMMING PROBLEM
THE DATA TRANSFORMATION PROBLEM
30
DATA TRANSFORMATION
Business System Data
Machine Generated Data
Log Data
Data Visualization
Fraud Detection
Recommendations
DATA SOURCE
Complexity
DATA PRODUCT
Simplicity
… …
Slide 31
Slide 31 text
TRANSFORMATION PROGRAMMING
Languages: Python, Bash, Ruby, Perl…
DSLs: DataStep, AJAX, Pandas, dplyr, Wrangle, Ibis…
31
Domain Specific Language (DSL)
Data Output
write code, compile, run
Slide 32
Slide 32 text
POTTER’S WHEEL (2001): ENTER THE VISUAL
➔ Visual DSL
➔ Immediate feedback
➔ Ongoing discrepancy detection
➔ Data lineage, redo/undo
32
[Raman & Hellerstein, VLDB11]
Slide 33
Slide 33 text
Lifting from DSL to Visual Language
33
Domain Specific Language (DSL)
Data Output
write code, compile, run
Visualization and Interaction
View Result
visualize
interact
Lift Ground
compile
Problem: Remaining burden of specification for users.
Slide 34
Slide 34 text
My software doesn’t understand
what I’m trying to do.
Slide 35
Slide 35 text
I don’t (yet) know
what I’m trying to do.
Slide 36
Slide 36 text
HINTS OF INTELLIGENT INTERFACES
Type-ahead uses context
and data to predict search
terms and preview results.
Slide 37
Slide 37 text
SEARCH QUERY AUTO-COMPLETE
37
Search Engine Query
Textbox Query
Response Suggestions
pick
type
GUIDE DECIDE
predict
What about more complex input/output relations?
The input and output domains are the same: text.
Slide 38
Slide 38 text
WRANGLER (2011): ADD INTELLIGENCE
38
[Kandel, et al. CHI 11]
[Guo, et al. UIST11]
➔ Automatic inference of transforms
➔ Predictive preview of results
➔ Interactive history
➔ User Studies
http://vis.stanford.edu/wrangler
Slide 39
Slide 39 text
TRADITIONAL DATA TRANSFORMATION
39
Visualization and Interaction
Data Transformation Code
User authors a draft
transformation script
User tests the script on a
small amount of data
User inspects output data to
assess effects
1. 2.
3.
Slide 40
Slide 40 text
Trifacta. Confidential & Proprietary.
PREDICTIVE INTERACTION
40
Visualization and Interaction
Data Transformation Code
User highlights
visual features of
the data
Data previews
allow user to
choose, adjust
and confirm
Algorithms
predict a ranked
list of scalable
transforms
1. 3.
2.
GUIDE DECIDE
Slide 41
Slide 41 text
PREDICTIVE INTERACTION
41
Domain Specific Language (DSL)
Visualization and Interaction
Data Output
write code, compile, run
View Result
visualize compile
Response Preview
pick
interact predict
GUIDE DECIDE
codegen present
Lift Ground
[Heer, Hellerstein, Kandel, CIDR15]
Slide 42
Slide 42 text
Empowering businesses
to innovate with data.
Slide 43
Slide 43 text
Wrangling Web Chat Log Data
43
Business Challenge:
Understanding web chat
interactions to personalize the
customer experience
Data Challenge:
Only 0.01% of web chat logs
analyzed due to complexity
• Large volumes of unstructured,
difficult to prep, web chat data
being created
• Only 200 chats manually extracted
per month and analyzed for quality
assurance
• Valuable frontline time taken up by
manual processing
• Limited insight into what their
customers are speaking to them
about
• In retail banking, web-based self-
service has surpassed both in
person and call center usage
• At RBS, 250,000 customer chats
per month launched for multiple
banking needs
• Analyzing web chat data can
provide valuable information about
customer needs and pain points
Trifacta:
Providing a self-service
solution to wrangle 100% of
logs
• 100% of web chat logs now
prepped and analyzed
• Went from processing 200 logs to
250,000 logs…and now automated,
not manual!
• Have new insight into customer
needs
SOME LESSONS
➔Predictive Interaction: Guide and Decide
➔A UX model for AI-assisted, human-driven tasks
➔DSLs at the center
➔A formal “narrow waist”
➔Targetable to multiple runtimes
➔Provides a modest, factored search space for learning & prediction
➔Interactive Profiling
➔Continuous data vis feedback during transformation
➔Data profile qua data interface
WHAT CHANGED WITH BIG DATA?
Shift in technology
Data representations
Shift in behavior
Data-driven organizations
Slide 52
Slide 52 text
Shift in behavior
Data-driven organizations
Slide 53
Slide 53 text
By 2017:
marketing spends more on tech than IT.
Data escapes IT
GARTNER GROUP
Slide 54
Slide 54 text
By 2017:
marketing spends more on tech than IT.
Data escapes IT
GARTNER GROUP
By 2020:
90% of IT budget controlled outside of IT.
Slide 55
Slide 55 text
MANY USE CASES
MANY CONSTITUENCIES
MANY INCENTIVES
MANY CONTEXTS
Slide 56
Slide 56 text
Shift in technology
Data representations
Slide 57
Slide 57 text
What does it
mean?
It depends on
the context.
Raw data in the data lake
Simplifies capture
Encourages exploration
Slide 58
Slide 58 text
MANY SCRIPTS
MANY MODELS
MANY APPLICATIONS
MANY CONTEXTS
Slide 59
Slide 59 text
It’s time to establish a bigger context for big data.
Historical context
Because
things change
Behavioral context
Because behavior
determines meaning
Application context
Because truth
is subjective
THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT
Slide 60
Slide 60 text
APPLICATION CONTEXT
Metadata
Models for interpreting
the data for use
• Data structures
• Semantic structures
• Statistical structures
Theme: services must provide an unopinionated model of context
Slide 61
Slide 61 text
HISTORICAL CONTEXT
Versions
Web logs Code to extract user/
movie rentals
Recommender for
movie licensing
Point in time
A promising new
movie is similar to
older hot movies at
time of release!
Trends over time
How does a movie
with these features
fare over time?
Slide 62
Slide 62 text
BEHAVIORAL CONTEXT
Why Dora?!
Lineage & Usage
Slide 63
Slide 63 text
2 4 8 7 9
BEHAVIORAL CONTEXT
Lineage & Usage
Data Science
Recommenders
“You should compare
with book sales from
last year.”
Curation Tips
“Logistics staff checks
weather data the 1st
Monday of every
month.”
Proactive
Impact Analysis
“The Twitter analysis
script changed. You
should check the boss’
dashboard!”
Slide 64
Slide 64 text
7
7
9
9
THE BIG CONTEXT
A NEW WORLD NEEDS NEW SERVICES
Slide 65
Slide 65 text
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
Slide 66
Slide 66 text
COMMON GROUND
Version-Model-Lineage (VML) Graphs
Model Graphs
Version Graphs
Usage Graphs: Lineage
Slide 67
Slide 67 text
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
CONTEXT MODEL
COMMON GROUND
Parsing &
Featurization
Catalog &
Discovery
Wrangling
Analytics &
Vis
Reference
Data
Data
Quality
Reproducibility
Model
Serving
Scavenging
and Ingestion
Search &
Query
Scheduling &
Workflow
Versioned
Storage ID & Auth
ABOVEGROUND API TO APPLICATIONS
UNDERGROUND API TO SERVICES
RESEARCH OPPORTUNITIES ACROSS THE STACK
Slide 68
Slide 68 text
IN SUM: PEOPLE + DATA + COMPUTATION
➔Dealing with Data: involves much more than algorithms
➔Human Component: a huge opportunity for tech innovation
➔Context is Key: for grounding analysis
68