Upgrade to Pro — share decks privately, control downloads, hide ads and more …

People, Computers, and the 
Hot Mess of Real Data

People, Computers, and the 
Hot Mess of Real Data

Keynote, KDD 2016

Joe Hellerstein

August 15, 2016
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. People, Computers, and the 

    Hot Mess of Real Data
    Joe Hellerstein

    View Slide

  2. WHO AM I
    2
    ?

    View Slide

  3. THE MISSING THIRD INGREDIENT: PEOPLE
    3
    Research imperative: 

    Dramatically simplify labor-intensive tasks … in the analytic lifecycle.
    2010
    Computing is free.
    Storage is free.
    Data is abundant.
    The remaining bottlenecks lie with people.

    View Slide

  4. A SIDE PROJECT
    4
    dp = datapeople
    http://deepresearch.org

    View Slide

  5. dp (c. 2012)
    5
    Jeff Heer

    Stanford
    Tapan Parikh
    Berkeley
    Maneesh Agrawala
    Berkeley
    Joe Hellerstein
    Berkeley
    Sean Diana Ravi
    Kandel MacLean Parikh
    Kuang Nicholas Wesley

    Chen Kong Willett

    View Slide

  6. THE ANALYTIC LIFECYCLE
    6
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View Slide

  7. THE ANALYTIC LIFECYCLE
    7
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION
    KDD, SIGMOD, SOSP, NIPS, etc.

    View Slide

  8. THE ANALYTIC LIFECYCLE
    8
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View Slide

  9. THE ANALYTIC LIFECYCLE
    9
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION
    Shreddr
    Wrangler
    MADlib
    d3
    [Chen et al., DEV12]
    [Kandel, et al. CHI 11]
    [Hellerstein, et al. VLDB 12]
    [Bostock et al. Infovis 11]
    CommentSpace [Willett et al. CHI 11]

    View Slide

  10. THE ANALYTIC LIFECYCLE
    10
    Shreddr
    Wrangler
    MADlib
    d3
    CommentSpace
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View Slide

  11. THREE CHAPTERS
    ➔ Data Acquisition. (Shreddr —> Captricity)
    ➔ Data Wrangling (Potter’s Wheel —> Wrangler —> Trifacta)
    ➔ Data Context (Ground)
    11

    View Slide

  12. THE ANALYTIC LIFECYCLE
    12
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View Slide

  13. Data in the First Mile

    View Slide

  14. 14
    Extracting value from data without
    waiting for infrastructure

    View Slide

  15. 15
    Shreddr

    View Slide

  16. 16
    Shreddr: Columnar Data Entry & Confirmation

    View Slide

  17. Select the values are not: Michael
    17
    Shreddr: Columnar Data Entry & Confirmation

    View Slide

  18. View Slide

  19. ANALYTICS ENABLEMENT

    Extracting Data from 1M+ Death Claims
    19
    CHALLENGE…
    No easy access to “cause of death” data
    100’s of templates to identify, sort and capture
    UNLOCKED
    Improve fraud detection by leveraging patterns
    found in historical customer data

    View Slide

  20. 20

    View Slide

  21. SOME LESSONS
    ➔(Problems from the field) × (Ideas from the lab)
    ➔Apply systems ideas to remove UX bottlenecks
    ➔Column compression
    ➔Batch processing & instruction locality
    ➔Filter pipelines
    ➔Crowdsourcing: first hints of Human/Machine collaboration
    ➔Humans as algorithmic agents
    ➔Challenge: optimize the human work

    View Slide

  22. THE ANALYTIC LIFECYCLE
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    COLLABORATION
    ACQUISITION

    View Slide

  23. DATA WRANGLING: A USER-CENTRIC TASK
    23
    Designing For Humans Not Designing for SciFi

    View Slide

  24. Talk to Humans

    View Slide

  25. WHERE DOES THE TIME GO IN ANALYTICS?
    PROCESSING
    ANALYTICS
    80%
    of the work in any
    data project is
    preparing the data.
    Patil, Data Jujitsu, 2012.
    Kandel et al. “Enterprise
    Data Analysis and
    Visualization: An Interview
    Study”, IEEE VAST, 2012.

    View Slide

  26. Interview study of 35 analysts:
    25 companies
    Healthcare
    Retail, Marketing
    Social networking
    Media
    Finance, Insurance
    Various titles
    Data analyst
    Data scientist
    Software engineer
    Consultant
    Chief technical officer
    [Kandel et al., VAST12]
    KANDEL SURVEY
    26

    View Slide

  27. “I spend more than half of my time integrating, cleansing and
    transforming data without doing any actual analysis. Most of
    the time I’m lucky if I get to do any ‘analysis’ at all.”
    Friction
    “Most of the time once you transform the data ... the insights
    can be scarily obvious.”
    Lost potential

    View Slide

  28. “It’s easy to just think you know what you are doing and not look
    at data at every intermediary step.
    An analysis has 30 different steps. It’s tempting to just do this
    then that and then this. You have no idea in which ways you are
    wrong and what data is wrong.”
    Interactivity and Visualization

    View Slide

  29. 29

    View Slide

  30. A PROGRAMMING PROBLEM
    THE DATA TRANSFORMATION PROBLEM
    30
    DATA TRANSFORMATION
    Business System Data
    Machine Generated Data
    Log Data
    Data Visualization
    Fraud Detection
    Recommendations
    DATA SOURCE
    Complexity
    DATA PRODUCT
    Simplicity
    … …

    View Slide

  31. TRANSFORMATION PROGRAMMING
    Languages: Python, Bash, Ruby, Perl…
    DSLs: DataStep, AJAX, Pandas, dplyr, Wrangle, Ibis…
    31
    Domain Specific Language (DSL)
    Data Output
    write code, compile, run

    View Slide

  32. POTTER’S WHEEL (2001): ENTER THE VISUAL
    ➔ Visual DSL
    ➔ Immediate feedback
    ➔ Ongoing discrepancy detection
    ➔ Data lineage, redo/undo
    32
    [Raman & Hellerstein, VLDB11]

    View Slide

  33. Lifting from DSL to Visual Language
    33
    Domain Specific Language (DSL)
    Data Output
    write code, compile, run
    Visualization and Interaction
    View Result
    visualize
    interact
    Lift Ground
    compile
    Problem: Remaining burden of specification for users.

    View Slide

  34. My software doesn’t understand
    what I’m trying to do.

    View Slide

  35. I don’t (yet) know
    what I’m trying to do.

    View Slide

  36. HINTS OF INTELLIGENT INTERFACES
    Type-ahead uses context
    and data to predict search
    terms and preview results.

    View Slide

  37. SEARCH QUERY AUTO-COMPLETE
    37
    Search Engine Query
    Textbox Query
    Response Suggestions
    pick
    type
    GUIDE DECIDE
    predict
    What about more complex input/output relations?
    The input and output domains are the same: text.

    View Slide

  38. WRANGLER (2011): ADD INTELLIGENCE
    38
    [Kandel, et al. CHI 11]
    [Guo, et al. UIST11]
    ➔ Automatic inference of transforms
    ➔ Predictive preview of results
    ➔ Interactive history
    ➔ User Studies
    http://vis.stanford.edu/wrangler

    View Slide

  39. TRADITIONAL DATA TRANSFORMATION
    39
    Visualization and Interaction
    Data Transformation Code
    User authors a draft
    transformation script
    User tests the script on a
    small amount of data
    User inspects output data to
    assess effects
    1. 2.
    3.

    View Slide

  40. Trifacta. Confidential & Proprietary.
    PREDICTIVE INTERACTION
    40
    Visualization and Interaction
    Data Transformation Code
    User highlights
    visual features of
    the data
    Data previews
    allow user to
    choose, adjust
    and confirm
    Algorithms
    predict a ranked
    list of scalable
    transforms
    1. 3.
    2.
    GUIDE DECIDE

    View Slide

  41. PREDICTIVE INTERACTION
    41
    Domain Specific Language (DSL)
    Visualization and Interaction
    Data Output
    write code, compile, run
    View Result
    visualize compile
    Response Preview
    pick
    interact predict
    GUIDE DECIDE
    codegen present
    Lift Ground
    [Heer, Hellerstein, Kandel, CIDR15]

    View Slide

  42. Empowering businesses
    to innovate with data.

    View Slide

  43. Wrangling Web Chat Log Data
    43
    Business Challenge:
    Understanding web chat
    interactions to personalize the
    customer experience
    Data Challenge:
    Only 0.01% of web chat logs
    analyzed due to complexity
    • Large volumes of unstructured,
    difficult to prep, web chat data
    being created
    • Only 200 chats manually extracted
    per month and analyzed for quality
    assurance
    • Valuable frontline time taken up by
    manual processing
    • Limited insight into what their
    customers are speaking to them
    about
    • In retail banking, web-based self-
    service has surpassed both in
    person and call center usage
    • At RBS, 250,000 customer chats
    per month launched for multiple
    banking needs
    • Analyzing web chat data can
    provide valuable information about
    customer needs and pain points
    Trifacta:
    Providing a self-service
    solution to wrangle 100% of
    logs
    • 100% of web chat logs now
    prepped and analyzed
    • Went from processing 200 logs to
    250,000 logs…and now automated,
    not manual!
    • Have new insight into customer
    needs

    View Slide

  44. © 2016 Royal Bank of Scotland Group. All rights Reserved
    The classification of this document is PUBLIC.
    “The dashboard is transforming the way I run my business. It is
    improving the customer-centric approach in our chats and it is showing
    in the output that we now see”
    Akshay Vats - Head of Web Chat Operation (India)
    Empowering RBS’s frontline staff

    View Slide

  45. SOME LESSONS
    ➔Predictive Interaction: Guide and Decide
    ➔A UX model for AI-assisted, human-driven tasks
    ➔DSLs at the center
    ➔A formal “narrow waist”
    ➔Targetable to multiple runtimes
    ➔Provides a modest, factored search space for learning & prediction
    ➔Interactive Profiling
    ➔Continuous data vis feedback during transformation
    ➔Data profile qua data interface

    View Slide

  46. 46
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View Slide

  47. 47
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION
    CONTEXT

    View Slide

  48. 48
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    CONTEXT

    View Slide

  49. A broader context for big data
    ground

    View Slide

  50. ground
    A broader context for big data

    View Slide

  51. WHAT CHANGED WITH BIG DATA?
    Shift in technology

    Data representations
    Shift in behavior

    Data-driven organizations

    View Slide

  52. Shift in behavior

    Data-driven organizations

    View Slide

  53. By 2017: 

    marketing spends more on tech than IT.
    Data escapes IT
    GARTNER GROUP

    View Slide

  54. By 2017: 

    marketing spends more on tech than IT.
    Data escapes IT
    GARTNER GROUP
    By 2020: 

    90% of IT budget controlled outside of IT.

    View Slide

  55. MANY USE CASES
    MANY CONSTITUENCIES
    MANY INCENTIVES
    MANY CONTEXTS

    View Slide

  56. Shift in technology

    Data representations

    View Slide

  57. What does it

    mean?
    It depends on

    the context.
    Raw data in the data lake

    Simplifies capture
    Encourages exploration

    View Slide

  58. MANY SCRIPTS
    MANY MODELS
    MANY APPLICATIONS
    MANY CONTEXTS

    View Slide

  59. It’s time to establish a bigger context for big data.
    Historical context

    Because

    things change
    Behavioral context

    Because behavior
    determines meaning
    Application context
    Because truth

    is subjective
    THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT

    View Slide

  60. APPLICATION CONTEXT
    Metadata
    Models for interpreting

    the data for use
    • Data structures
    • Semantic structures
    • Statistical structures
    Theme: services must provide an unopinionated model of context

    View Slide

  61. HISTORICAL CONTEXT
    Versions
    Web logs Code to extract user/
    movie rentals
    Recommender for
    movie licensing
    Point in time

    A promising new

    movie is similar to
    older hot movies at
    time of release!
    Trends over time

    How does a movie

    with these features

    fare over time?

    View Slide

  62. BEHAVIORAL CONTEXT
    Why Dora?!
    Lineage & Usage

    View Slide

  63. 2 4 8 7 9
    BEHAVIORAL CONTEXT
    Lineage & Usage
    Data Science
    Recommenders
    “You should compare
    with book sales from
    last year.”
    Curation Tips
    “Logistics staff checks
    weather data the 1st
    Monday of every
    month.”
    Proactive

    Impact Analysis
    “The Twitter analysis
    script changed. You
    should check the boss’
    dashboard!”

    View Slide

  64. 7
    7
    9
    9
    THE BIG CONTEXT
    A NEW WORLD NEEDS NEW SERVICES

    View Slide

  65. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth

    View Slide

  66. COMMON GROUND
    Version-Model-Lineage (VML) Graphs
    Model Graphs
    Version Graphs
    Usage Graphs: Lineage

    View Slide

  67. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    RESEARCH OPPORTUNITIES ACROSS THE STACK

    View Slide

  68. IN SUM: PEOPLE + DATA + COMPUTATION
    ➔Dealing with Data: involves much more than algorithms
    ➔Human Component: a huge opportunity for tech innovation
    ➔Context is Key: for grounding analysis
    68

    View Slide

  69. @joe_hellerstein
    [email protected]

    View Slide