Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data and Predictive Modeling

Big Data and Predictive Modeling

Web We Can 2015

Olivier Grisel

March 21, 2015
Tweet

More Decks by Olivier Grisel

Other Decks in Technology

Transcript

  1. Big Data and
    Predictive Modeling
    Olivier Grisel — @ogrisel
    Web We Can
    March 21, 2015

    View Slide

  2. About me

    View Slide

  3. Big Data as a
    buzzword

    View Slide

  4. Triumph of the Nerds: Nate
    Silver Wins in 50 States
    http://mashable.com/2012/11/07/nate-silver-wins/

    View Slide

  5. Triumph of the Nerds: Nate
    Silver Wins in 50 States
    http://mashable.com/2012/11/07/nate-silver-wins/

    View Slide

  6. Nate Silver’s election model,
    Big Data?
    $ git clone gh:jseabold/538model
    $ du -h 538model/data
    188K 538model/data

    View Slide

  7. 15% of the capacity of a 3’5 floppy disk

    View Slide

  8. Regionator 3000
    http://labs.data-publica.com/regionator3000/

    View Slide

  9. http://transports.blog.lemonde.fr/2014/06/05/regionator-la-
    carte-de-france-dessinee-par-les-trajets-quotidiens/

    View Slide

  10. http://transports.blog.lemonde.fr/2014/06/05/regionator-la-
    carte-de-france-dessinee-par-les-trajets-quotidiens/

    View Slide

  11. http://www.insee.fr/fr/themes/detail.asp?
    reg_id=99&ref_id=mobilite-professionnelle-10

    View Slide

  12. http://www.insee.fr/fr/themes/detail.asp?
    reg_id=99&ref_id=mobilite-professionnelle-10

    View Slide

  13. 120% of the capacity of a 3’5 floppy disk

    View Slide

  14. Big Data

    Predictive Analytics

    View Slide

  15. Predictive Analytics

    Descriptive Analytics

    View Slide

  16. Goals of this talk
    • What Big Data actually is or isn’t
    • Introduce predictive modeling concepts
    • Contrast predictive analytics vs descriptive
    analytics

    View Slide

  17. How big is Big Data?

    View Slide

  18. – Wikipedia
    “Big data is a blanket term for any collection of
    data sets so large and complex that it
    becomes difficult to process using on-hand
    database management tools or traditional data
    processing applications.”

    View Slide

  19. Not Big Data
    • Data that fits on a spreadsheet
    • Data that can be analyzed in RAM (< 10 GB)
    • Data operations that can be performed quickly
    by a traditional database, e.g. single node
    PostgreSQL server

    View Slide

  20. Reading the full content of a 1TB
    HDD at 100MB/s:
    2 hours 45 minutes

    View Slide

  21. Canonical Big Data problem:
    indexing the Web
    • Inverted index on tera bytes of text data
    • Process each HTML page as a URL + bag of
    words
    • For each word, aggregate the list of page URLs
    • 2 billion HTML pages: 100TB
    >10 days just to read sequentially

    View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. Other Big Data examples
    • GSM location event log from telco
    • Transaction log of a big retail network
    • Raw traffic data on a large website
    • Activity records for a service with 10s of millions
    of users

    View Slide

  26. Not Big Data
    • Polls data (~10K data points)
    • Census data (~10M data points)
    • Real estate transactions data (~10M data points)
    • Any dataset publicly available for download

    View Slide

  27. What is Predictive
    Analytics?
    and Machine Learning

    View Slide

  28. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    Sold
    300k
    1.5M
    2.2M

    View Slide

  29. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples
    Sold
    300k
    1.5M
    2.2M
    target

    View Slide

  30. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples
    Sold
    300k
    1.5M
    2.2M
    target
    Apartment 2 35 3 Yes ?

    View Slide

  31. Predictive Modeling
    • Automated predictions of outcome on new data
    • Alternative to hard-coded rules written by
    experts
    • Extract the structure of historical data
    • Statistical tools to summarize the training data
    into an executable predictive model

    View Slide

  32. Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Model
    Predictive Modeling Data Flow
    Feature vectors

    View Slide

  33. New
    text doc
    image
    sound
    transaction
    Model
    Expected
    Label
    Predictive Modeling Data Flow
    Feature vector
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Feature vectors

    View Slide

  34. Descriptive
    vs
    Predictive

    View Slide

  35. Descriptive Statistics
    • Ex: Sales by (day, months, year) x region
    • Graphical visualization: get insights, tell a story to
    explain what’s happening in the data
    • Realm of Business Intelligence: reports & dashboard
    for managers
    • A wrong decision can be very costly
    • Small number of important decisions made by a
    human

    View Slide

  36. Predictive Statistics
    • Ex: Movie recommendations or targeted ads
    • Embedded in a user service to make it more
    useful / attractive / profitable
    • A wrong individual decision is not costly
    • Large number of small automated decisions
    • Humans would not be fast enough to make the
    predictions

    View Slide

  37. Mixed models
    • Predictive modeling to identify interesting
    subsets of the data
    • Ex: Fraud detection, churn forecasting
    • Help human decision makers focus on important
    cases
    • Human expert feedback to improve predictive
    models

    View Slide

  38. Key takeaway points:

    View Slide

  39. • Big Data ≠ Predictive Analytics
    • Predictive Analytics
    • Automated decision making embedded in products
    (e.g. recommenders)
    • Individual bad decisions are typically not costly
    • Descriptive Analytics
    • Business Intelligence: human decision making
    • Individual bad decisions can be very costly

    View Slide

  40. Thank you!
    Questions?
    @Inria @ogrisel http://scikit-learn.org

    View Slide

  41. Bonus tracks

    View Slide

  42. Back to the Regionator
    What if we did not have census data on daily mobility?

    View Slide

  43. Back to the Regionator
    • Use raw daily telco logs
    • Group By (phone, day) to
    extract daily trips
    • Join By GPS coordinates to
    “departement” names
    • Filter out small trips
    • Group By (home, work)
    “departements”
    • Count

    View Slide

  44. Tools for predictive
    analytics

    View Slide

  45. SPSS
    MATLAB

    View Slide

  46. SPSS
    MATLAB

    View Slide

  47. New
    text doc
    image
    sound
    transaction
    Model
    Expected
    Label
    Small data
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Feature vectors
    Feature vector

    View Slide

  48. New
    text doc
    image
    sound
    transaction
    Model
    Expected
    Label
    Small / Medium data
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Feature vectors
    Feature vector

    View Slide

  49. New
    text doc
    image
    sound
    transaction
    Model
    Expected
    Label
    Small / Medium data with
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Feature vectors
    Feature vector

    View Slide

  50. New
    text doc
    image
    sound
    transaction
    Model
    Expected
    Label
    Small / Medium data with
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Feature vectors
    Feature vector

    View Slide

  51. View Slide

  52. View Slide

  53. Predictive Analytics on
    Big Data

    View Slide

  54. Model
    Expected
    Label
    Big data with
    Machine
    Learning
    Algorithm
    New
    text doc
    image
    sound
    transaction
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Feature vectors
    Feature vector

    View Slide

  55. Model
    Expected
    Label
    Big data with
    Machine
    Learning
    Algorithm
    New
    text doc
    image
    sound
    transaction
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Feature vectors
    Feature vector

    View Slide

  56. Model
    Expected
    Label
    Big data with
    Machine
    Learning
    Algorithm
    New
    text doc
    image
    sound
    transaction
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Feature vectors
    Feature vector

    View Slide

  57. View Slide

  58. BIG DATA

    View Slide

  59. BIG DATA small(er) data

    View Slide

  60. BIG DATA small(er) data

    View Slide

  61. From Big to Small
    • Feature extraction often shrinks data
    • Filter / Join / Group By / Count
    • Machine Learning performed on aggregates
    • Sampling for fast in-memory iterative modeling

    View Slide

  62. Data size and
    modeling quality

    View Slide

  63. – Peter Norvig, Research Director, Google
    “We don’t have better algorithms. We just have
    more data.”

    View Slide

  64. View Slide

  65. View Slide

  66. View Slide

  67. More data beats
    better models?

    View Slide

  68. http://technocalifornia.blogspot.fr/2012/07/more-data-or-better-
    models.html

    View Slide

  69. Let’s train a parametric model
    to read handwritten digits from gray level pixels.

    View Slide

  70. View Slide

  71. model stops improving

    View Slide

  72. View Slide

  73. Bias vs Variance

    View Slide

  74. high bias

    View Slide

  75. high bias
    high variance

    View Slide

  76. high bias
    high variance
    low variance

    View Slide

  77. Variance solution #1:
    collect more samples

    View Slide

  78. Let’s train a non-parametric model
    to read handwritten digits from gray level pixels.

    View Slide

  79. View Slide

  80. high variance
    almost no bias
    variance
    decreasing
    with #samples

    View Slide

  81. Bias solution #1:
    non-parametric models

    View Slide

  82. Type # rooms
    Surface
    (m2)
    Floor
    Public
    Transp.
    Apart. 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples

    View Slide

  83. Type # rooms
    Surface
    (m2)
    Floor
    Public
    Transp.
    School
    (km)
    Flood
    plain
    Apart. 3 65 2 Yes 1.0 No
    House 5 110 NA No 25.0 Yes
    Duplex 4 95 4 Yes 0.5 No
    features
    samples

    View Slide

  84. Bias solution #2:
    richer features

    View Slide

  85. Data has 2 dimensions:
    # samples and # features

    View Slide

  86. View Slide

  87. • Parametric e.g. linear model (traditional stats) vs
    Non-parametric e.g. Random Forests, Neural
    Networks (Machine Learning)
    • Understand a model with 10% accuracy vs blindly
    trust a model with 90% accuracy
    • Simple models e.g. F = m a, F = - G (m1 + m2) /
    r^2 will not become false(r) because of big data
    • New problems can be tackled: computer vision,
    speech recognition, natural language
    understanding

    View Slide

  88. • the (experimental) scientific method introduced
    by Karl Popper is based on the falsifiability of
    formulated hypotheses
    • theory is correct as long as past predictions hold
    in new experiments
    • machine learning train-validation-test splits and
    cross-validation is similar in spirit
    • ml model is just a complex theory: correct as
    long as its predictions still hold

    View Slide