Upgrade to Pro — share decks privately, control downloads, hide ads and more …

People, Computers, and the 
Hot Mess of Real Data

People, Computers, and the 
Hot Mess of Real Data

Keynote, KDD 2016

Joe Hellerstein

August 15, 2016
Tweet

More Decks by Joe Hellerstein

Other Decks in Technology

Transcript

  1. People, Computers, and the 

    Hot Mess of Real Data
    Joe Hellerstein

    View full-size slide

  2. THE MISSING THIRD INGREDIENT: PEOPLE
    3
    Research imperative: 

    Dramatically simplify labor-intensive tasks … in the analytic lifecycle.
    2010
    Computing is free.
    Storage is free.
    Data is abundant.
    The remaining bottlenecks lie with people.

    View full-size slide

  3. A SIDE PROJECT
    4
    dp = datapeople
    http://deepresearch.org

    View full-size slide

  4. dp (c. 2012)
    5
    Jeff Heer

    Stanford
    Tapan Parikh
    Berkeley
    Maneesh Agrawala
    Berkeley
    Joe Hellerstein
    Berkeley
    Sean Diana Ravi
    Kandel MacLean Parikh
    Kuang Nicholas Wesley

    Chen Kong Willett

    View full-size slide

  5. THE ANALYTIC LIFECYCLE
    6
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View full-size slide

  6. THE ANALYTIC LIFECYCLE
    7
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION
    KDD, SIGMOD, SOSP, NIPS, etc.

    View full-size slide

  7. THE ANALYTIC LIFECYCLE
    8
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View full-size slide

  8. THE ANALYTIC LIFECYCLE
    9
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION
    Shreddr
    Wrangler
    MADlib
    d3
    [Chen et al., DEV12]
    [Kandel, et al. CHI 11]
    [Hellerstein, et al. VLDB 12]
    [Bostock et al. Infovis 11]
    CommentSpace [Willett et al. CHI 11]

    View full-size slide

  9. THE ANALYTIC LIFECYCLE
    10
    Shreddr
    Wrangler
    MADlib
    d3
    CommentSpace
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View full-size slide

  10. THREE CHAPTERS
    ➔ Data Acquisition. (Shreddr —> Captricity)
    ➔ Data Wrangling (Potter’s Wheel —> Wrangler —> Trifacta)
    ➔ Data Context (Ground)
    11

    View full-size slide

  11. THE ANALYTIC LIFECYCLE
    12
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View full-size slide

  12. Data in the First Mile

    View full-size slide

  13. 14
    Extracting value from data without
    waiting for infrastructure

    View full-size slide

  14. 16
    Shreddr: Columnar Data Entry & Confirmation

    View full-size slide

  15. Select the values are not: Michael
    17
    Shreddr: Columnar Data Entry & Confirmation

    View full-size slide

  16. ANALYTICS ENABLEMENT

    Extracting Data from 1M+ Death Claims
    19
    CHALLENGE…
    No easy access to “cause of death” data
    100’s of templates to identify, sort and capture
    UNLOCKED
    Improve fraud detection by leveraging patterns
    found in historical customer data

    View full-size slide

  17. SOME LESSONS
    ➔(Problems from the field) × (Ideas from the lab)
    ➔Apply systems ideas to remove UX bottlenecks
    ➔Column compression
    ➔Batch processing & instruction locality
    ➔Filter pipelines
    ➔Crowdsourcing: first hints of Human/Machine collaboration
    ➔Humans as algorithmic agents
    ➔Challenge: optimize the human work

    View full-size slide

  18. THE ANALYTIC LIFECYCLE
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    COLLABORATION
    ACQUISITION

    View full-size slide

  19. DATA WRANGLING: A USER-CENTRIC TASK
    23
    Designing For Humans Not Designing for SciFi

    View full-size slide

  20. Talk to Humans

    View full-size slide

  21. WHERE DOES THE TIME GO IN ANALYTICS?
    PROCESSING
    ANALYTICS
    80%
    of the work in any
    data project is
    preparing the data.
    Patil, Data Jujitsu, 2012.
    Kandel et al. “Enterprise
    Data Analysis and
    Visualization: An Interview
    Study”, IEEE VAST, 2012.

    View full-size slide

  22. Interview study of 35 analysts:
    25 companies
    Healthcare
    Retail, Marketing
    Social networking
    Media
    Finance, Insurance
    Various titles
    Data analyst
    Data scientist
    Software engineer
    Consultant
    Chief technical officer
    [Kandel et al., VAST12]
    KANDEL SURVEY
    26

    View full-size slide

  23. “I spend more than half of my time integrating, cleansing and
    transforming data without doing any actual analysis. Most of
    the time I’m lucky if I get to do any ‘analysis’ at all.”
    Friction
    “Most of the time once you transform the data ... the insights
    can be scarily obvious.”
    Lost potential

    View full-size slide

  24. “It’s easy to just think you know what you are doing and not look
    at data at every intermediary step.
    An analysis has 30 different steps. It’s tempting to just do this
    then that and then this. You have no idea in which ways you are
    wrong and what data is wrong.”
    Interactivity and Visualization

    View full-size slide

  25. A PROGRAMMING PROBLEM
    THE DATA TRANSFORMATION PROBLEM
    30
    DATA TRANSFORMATION
    Business System Data
    Machine Generated Data
    Log Data
    Data Visualization
    Fraud Detection
    Recommendations
    DATA SOURCE
    Complexity
    DATA PRODUCT
    Simplicity
    … …

    View full-size slide

  26. TRANSFORMATION PROGRAMMING
    Languages: Python, Bash, Ruby, Perl…
    DSLs: DataStep, AJAX, Pandas, dplyr, Wrangle, Ibis…
    31
    Domain Specific Language (DSL)
    Data Output
    write code, compile, run

    View full-size slide

  27. POTTER’S WHEEL (2001): ENTER THE VISUAL
    ➔ Visual DSL
    ➔ Immediate feedback
    ➔ Ongoing discrepancy detection
    ➔ Data lineage, redo/undo
    32
    [Raman & Hellerstein, VLDB11]

    View full-size slide

  28. Lifting from DSL to Visual Language
    33
    Domain Specific Language (DSL)
    Data Output
    write code, compile, run
    Visualization and Interaction
    View Result
    visualize
    interact
    Lift Ground
    compile
    Problem: Remaining burden of specification for users.

    View full-size slide

  29. My software doesn’t understand
    what I’m trying to do.

    View full-size slide

  30. I don’t (yet) know
    what I’m trying to do.

    View full-size slide

  31. HINTS OF INTELLIGENT INTERFACES
    Type-ahead uses context
    and data to predict search
    terms and preview results.

    View full-size slide

  32. SEARCH QUERY AUTO-COMPLETE
    37
    Search Engine Query
    Textbox Query
    Response Suggestions
    pick
    type
    GUIDE DECIDE
    predict
    What about more complex input/output relations?
    The input and output domains are the same: text.

    View full-size slide

  33. WRANGLER (2011): ADD INTELLIGENCE
    38
    [Kandel, et al. CHI 11]
    [Guo, et al. UIST11]
    ➔ Automatic inference of transforms
    ➔ Predictive preview of results
    ➔ Interactive history
    ➔ User Studies
    http://vis.stanford.edu/wrangler

    View full-size slide

  34. TRADITIONAL DATA TRANSFORMATION
    39
    Visualization and Interaction
    Data Transformation Code
    User authors a draft
    transformation script
    User tests the script on a
    small amount of data
    User inspects output data to
    assess effects
    1. 2.
    3.

    View full-size slide

  35. Trifacta. Confidential & Proprietary.
    PREDICTIVE INTERACTION
    40
    Visualization and Interaction
    Data Transformation Code
    User highlights
    visual features of
    the data
    Data previews
    allow user to
    choose, adjust
    and confirm
    Algorithms
    predict a ranked
    list of scalable
    transforms
    1. 3.
    2.
    GUIDE DECIDE

    View full-size slide

  36. PREDICTIVE INTERACTION
    41
    Domain Specific Language (DSL)
    Visualization and Interaction
    Data Output
    write code, compile, run
    View Result
    visualize compile
    Response Preview
    pick
    interact predict
    GUIDE DECIDE
    codegen present
    Lift Ground
    [Heer, Hellerstein, Kandel, CIDR15]

    View full-size slide

  37. Empowering businesses
    to innovate with data.

    View full-size slide

  38. Wrangling Web Chat Log Data
    43
    Business Challenge:
    Understanding web chat
    interactions to personalize the
    customer experience
    Data Challenge:
    Only 0.01% of web chat logs
    analyzed due to complexity
    • Large volumes of unstructured,
    difficult to prep, web chat data
    being created
    • Only 200 chats manually extracted
    per month and analyzed for quality
    assurance
    • Valuable frontline time taken up by
    manual processing
    • Limited insight into what their
    customers are speaking to them
    about
    • In retail banking, web-based self-
    service has surpassed both in
    person and call center usage
    • At RBS, 250,000 customer chats
    per month launched for multiple
    banking needs
    • Analyzing web chat data can
    provide valuable information about
    customer needs and pain points
    Trifacta:
    Providing a self-service
    solution to wrangle 100% of
    logs
    • 100% of web chat logs now
    prepped and analyzed
    • Went from processing 200 logs to
    250,000 logs…and now automated,
    not manual!
    • Have new insight into customer
    needs

    View full-size slide

  39. © 2016 Royal Bank of Scotland Group. All rights Reserved
    The classification of this document is PUBLIC.
    “The dashboard is transforming the way I run my business. It is
    improving the customer-centric approach in our chats and it is showing
    in the output that we now see”
    Akshay Vats - Head of Web Chat Operation (India)
    Empowering RBS’s frontline staff

    View full-size slide

  40. SOME LESSONS
    ➔Predictive Interaction: Guide and Decide
    ➔A UX model for AI-assisted, human-driven tasks
    ➔DSLs at the center
    ➔A formal “narrow waist”
    ➔Targetable to multiple runtimes
    ➔Provides a modest, factored search space for learning & prediction
    ➔Interactive Profiling
    ➔Continuous data vis feedback during transformation
    ➔Data profile qua data interface

    View full-size slide

  41. 46
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION

    View full-size slide

  42. 47
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    ACQUISITION
    CONTEXT

    View full-size slide

  43. 48
    ACQUISITION
    TRANSFORMATION
    ANALYSIS
    VISUALIZATION
    DECIDE/DEPLOY
    CONTEXT

    View full-size slide

  44. A broader context for big data
    ground

    View full-size slide

  45. ground
    A broader context for big data

    View full-size slide

  46. WHAT CHANGED WITH BIG DATA?
    Shift in technology

    Data representations
    Shift in behavior

    Data-driven organizations

    View full-size slide

  47. Shift in behavior

    Data-driven organizations

    View full-size slide

  48. By 2017: 

    marketing spends more on tech than IT.
    Data escapes IT
    GARTNER GROUP

    View full-size slide

  49. By 2017: 

    marketing spends more on tech than IT.
    Data escapes IT
    GARTNER GROUP
    By 2020: 

    90% of IT budget controlled outside of IT.

    View full-size slide

  50. MANY USE CASES
    MANY CONSTITUENCIES
    MANY INCENTIVES
    MANY CONTEXTS

    View full-size slide

  51. Shift in technology

    Data representations

    View full-size slide

  52. What does it

    mean?
    It depends on

    the context.
    Raw data in the data lake

    Simplifies capture
    Encourages exploration

    View full-size slide

  53. MANY SCRIPTS
    MANY MODELS
    MANY APPLICATIONS
    MANY CONTEXTS

    View full-size slide

  54. It’s time to establish a bigger context for big data.
    Historical context

    Because

    things change
    Behavioral context

    Because behavior
    determines meaning
    Application context
    Because truth

    is subjective
    THE MEANING AND VALUE OF DATA DEPENDS ON CONTEXT

    View full-size slide

  55. APPLICATION CONTEXT
    Metadata
    Models for interpreting

    the data for use
    • Data structures
    • Semantic structures
    • Statistical structures
    Theme: services must provide an unopinionated model of context

    View full-size slide

  56. HISTORICAL CONTEXT
    Versions
    Web logs Code to extract user/
    movie rentals
    Recommender for
    movie licensing
    Point in time

    A promising new

    movie is similar to
    older hot movies at
    time of release!
    Trends over time

    How does a movie

    with these features

    fare over time?

    View full-size slide

  57. BEHAVIORAL CONTEXT
    Why Dora?!
    Lineage & Usage

    View full-size slide

  58. 2 4 8 7 9
    BEHAVIORAL CONTEXT
    Lineage & Usage
    Data Science
    Recommenders
    “You should compare
    with book sales from
    last year.”
    Curation Tips
    “Logistics staff checks
    weather data the 1st
    Monday of every
    month.”
    Proactive

    Impact Analysis
    “The Twitter analysis
    script changed. You
    should check the boss’
    dashboard!”

    View full-size slide

  59. 7
    7
    9
    9
    THE BIG CONTEXT
    A NEW WORLD NEEDS NEW SERVICES

    View full-size slide

  60. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth

    View full-size slide

  61. COMMON GROUND
    Version-Model-Lineage (VML) Graphs
    Model Graphs
    Version Graphs
    Usage Graphs: Lineage

    View full-size slide

  62. ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    CONTEXT MODEL
    COMMON GROUND
    Parsing &

    Featurization
    Catalog &

    Discovery
    Wrangling
    Analytics &

    Vis
    Reference

    Data
    Data

    Quality
    Reproducibility
    Model

    Serving
    Scavenging

    and Ingestion
    Search &

    Query
    Scheduling &

    Workflow
    Versioned

    Storage ID & Auth
    ABOVEGROUND API TO APPLICATIONS
    UNDERGROUND API TO SERVICES
    RESEARCH OPPORTUNITIES ACROSS THE STACK

    View full-size slide

  63. IN SUM: PEOPLE + DATA + COMPUTATION
    ➔Dealing with Data: involves much more than algorithms
    ➔Human Component: a huge opportunity for tech innovation
    ➔Context is Key: for grounding analysis
    68

    View full-size slide