Big Data and Predictive Modeling

Big Data and Predictive Modeling

Web We Can 2015

Aee56554ec30edfd680e1c937ed4e54d?s=128

Olivier Grisel

March 21, 2015
Tweet

Transcript

  1. Big Data and Predictive Modeling Olivier Grisel — @ogrisel Web

    We Can March 21, 2015
  2. About me

  3. Big Data as a buzzword

  4. Triumph of the Nerds: Nate Silver Wins in 50 States

    http://mashable.com/2012/11/07/nate-silver-wins/
  5. Triumph of the Nerds: Nate Silver Wins in 50 States

    http://mashable.com/2012/11/07/nate-silver-wins/
  6. Nate Silver’s election model, Big Data? $ git clone gh:jseabold/538model

    $ du -h 538model/data 188K 538model/data
  7. 15% of the capacity of a 3’5 floppy disk

  8. Regionator 3000 http://labs.data-publica.com/regionator3000/

  9. http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

  10. http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

  11. http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

  12. http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

  13. 120% of the capacity of a 3’5 floppy disk

  14. Big Data ≠ Predictive Analytics

  15. Predictive Analytics ≠ Descriptive Analytics

  16. Goals of this talk • What Big Data actually is

    or isn’t • Introduce predictive modeling concepts • Contrast predictive analytics vs descriptive analytics
  17. How big is Big Data?

  18. – Wikipedia “Big data is a blanket term for any

    collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”
  19. Not Big Data • Data that fits on a spreadsheet

    • Data that can be analyzed in RAM (< 10 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server
  20. Reading the full content of a 1TB HDD at 100MB/s:

    2 hours 45 minutes
  21. Canonical Big Data problem: indexing the Web • Inverted index

    on tera bytes of text data • Process each HTML page as a URL + bag of words • For each word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially
  22. None
  23. None
  24. None
  25. Other Big Data examples • GSM location event log from

    telco • Transaction log of a big retail network • Raw traffic data on a large website • Activity records for a service with 10s of millions of users
  26. Not Big Data • Polls data (~10K data points) •

    Census data (~10M data points) • Real estate transactions data (~10M data points) • Any dataset publicly available for download
  27. What is Predictive Analytics? and Machine Learning

  28. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes Sold 300k 1.5M 2.2M
  29. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target
  30. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes ?
  31. Predictive Modeling • Automated predictions of outcome on new data

    • Alternative to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into an executable predictive model
  32. Training text docs images sounds transactions Labels Machine Learning Algorithm

    Model Predictive Modeling Data Flow Feature vectors
  33. New text doc image sound transaction Model Expected Label Predictive

    Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
  34. Descriptive vs Predictive

  35. Descriptive Statistics • Ex: Sales by (day, months, year) x

    region • Graphical visualization: get insights, tell a story to explain what’s happening in the data • Realm of Business Intelligence: reports & dashboard for managers • A wrong decision can be very costly • Small number of important decisions made by a human
  36. Predictive Statistics • Ex: Movie recommendations or targeted ads •

    Embedded in a user service to make it more useful / attractive / profitable • A wrong individual decision is not costly • Large number of small automated decisions • Humans would not be fast enough to make the predictions
  37. Mixed models • Predictive modeling to identify interesting subsets of

    the data • Ex: Fraud detection, churn forecasting • Help human decision makers focus on important cases • Human expert feedback to improve predictive models
  38. Key takeaway points:

  39. • Big Data ≠ Predictive Analytics • Predictive Analytics •

    Automated decision making embedded in products (e.g. recommenders) • Individual bad decisions are typically not costly • Descriptive Analytics • Business Intelligence: human decision making • Individual bad decisions can be very costly
  40. Thank you! Questions? @Inria @ogrisel http://scikit-learn.org

  41. Bonus tracks

  42. Back to the Regionator What if we did not have

    census data on daily mobility?
  43. Back to the Regionator • Use raw daily telco logs

    • Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count
  44. Tools for predictive analytics

  45. SPSS MATLAB

  46. SPSS MATLAB

  47. New text doc image sound transaction Model Expected Label Small

    data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  48. New text doc image sound transaction Model Expected Label Small

    / Medium data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  49. New text doc image sound transaction Model Expected Label Small

    / Medium data with Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  50. New text doc image sound transaction Model Expected Label Small

    / Medium data with Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  51. None
  52. None
  53. Predictive Analytics on Big Data

  54. Model Expected Label Big data with Machine Learning Algorithm New

    text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
  55. Model Expected Label Big data with Machine Learning Algorithm New

    text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
  56. Model Expected Label Big data with Machine Learning Algorithm New

    text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
  57. None
  58. BIG DATA

  59. BIG DATA small(er) data

  60. BIG DATA small(er) data

  61. From Big to Small • Feature extraction often shrinks data

    • Filter / Join / Group By / Count • Machine Learning performed on aggregates • Sampling for fast in-memory iterative modeling
  62. Data size and modeling quality

  63. – Peter Norvig, Research Director, Google “We don’t have better

    algorithms. We just have more data.”
  64. None
  65. None
  66. None
  67. More data beats better models?

  68. http://technocalifornia.blogspot.fr/2012/07/more-data-or-better- models.html

  69. Let’s train a parametric model to read handwritten digits from

    gray level pixels.
  70. None
  71. model stops improving

  72. None
  73. Bias vs Variance

  74. high bias

  75. high bias high variance

  76. high bias high variance low variance

  77. Variance solution #1: collect more samples

  78. Let’s train a non-parametric model to read handwritten digits from

    gray level pixels.
  79. None
  80. high variance almost no bias variance decreasing with #samples

  81. Bias solution #1: non-parametric models

  82. Type # rooms Surface (m2) Floor Public Transp. Apart. 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples
  83. Type # rooms Surface (m2) Floor Public Transp. School (km)

    Flood plain Apart. 3 65 2 Yes 1.0 No House 5 110 NA No 25.0 Yes Duplex 4 95 4 Yes 0.5 No features samples
  84. Bias solution #2: richer features

  85. Data has 2 dimensions: # samples and # features

  86. None
  87. • Parametric e.g. linear model (traditional stats) vs Non-parametric e.g.

    Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding
  88. • the (experimental) scientific method introduced by Karl Popper is

    based on the falsifiability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold