Big Data, Predictive Modeling and tools

Big Data, Predictive Modeling and tools

CCIP 2015

Aee56554ec30edfd680e1c937ed4e54d?s=128

Olivier Grisel

October 08, 2015
Tweet

Transcript

  1. Big Data, Predictive Modeling & Tools Olivier Grisel — @ogrisel

    CCIP October 2015
  2. About me

  3. Big Data as a buzzword

  4. Regionator 3000 http://labs.data-publica.com/regionator3000/

  5. http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

  6. http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

  7. http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

  8. http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

  9. 120% of the capacity of a 3’5 floppy disk

  10. Big Data ≠ Predictive Analytics

  11. Outline • What Big Data actually is or isn’t •

    Predictive modeling concepts & tools • Big Data architecture & tools
  12. How big is Big Data?

  13. Not Big Data • Data that fits on a spreadsheet

    • Data that can be analyzed in RAM (< 10 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server
  14. Reading the full content of a 1TB HDD at 100MB/s:

    2 hours 45 minutes
  15. Canonical Big Data problem: indexing the Web • For each

    word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially
  16. None
  17. None
  18. Other Big Data examples • GSM location event log from

    telco • Transaction log of a big retail network • Activity records for a service with 10s of millions of users
  19. What is Predictive Analytics? and Machine Learning

  20. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target
  21. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes ?
  22. Predictive Modeling • Automated predictions of outcome on new data

    • Alternative to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into an executable predictive model
  23. Training text docs images sounds transactions Labels Machine Learning Algorithm

    Model Predictive Modeling Data Flow Feature vectors
  24. New text doc image sound transaction Model Expected Label Predictive

    Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
  25. Predictive Models • Ex: Movie recommendations or targeted ads •

    Embedded in a user service to make it more useful / attractive / profitable • A wrong individual decision is not costly • Large number of small automated decisions • Humans would not be fast enough to make the predictions
  26. Tools for predictive analytics

  27. SPSS MATLAB

  28. SPSS MATLAB

  29. New text doc image sound transaction Model Expected Label Small

    data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  30. New text doc image sound transaction Model Expected Label Small

    / Medium data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  31. New text doc image sound transaction Model Expected Label Small

    / Medium data with Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  32. New text doc image sound transaction Model Expected Label Small

    / Medium data with Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector
  33. None
  34. None
  35. Model Expected Label Big data with Machine Learning Algorithm New

    text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
  36. Model Expected Label Big data with Machine Learning Algorithm New

    text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
  37. Model Expected Label Big data with Machine Learning Algorithm New

    text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
  38. Model Expected Label Big data with Machine Learning Algorithm New

    text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector
  39. Big Data architecture(s) & tools

  40. Distributed Event Log & Queues Distributed Event Stream Processing Distributed

    Storage Distributed Batch Processing On-line Transaction Processing & App Views Analytical Database Predictive Models
  41. None
  42. None
  43. Thank you! Questions? @Inria @ogrisel http://scikit-learn.org

  44. Bonus tracks

  45. Back to the Regionator What if we did not have

    census data on daily mobility?
  46. Back to the Regionator • Use raw daily telco logs

    • Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count
  47. Predictive Analytics on Big Data

  48. None
  49. BIG DATA

  50. BIG DATA small(er) data

  51. BIG DATA small(er) data

  52. From Big to Small • Feature extraction often shrinks data

    • Filter / Join / Group By / Count • Machine Learning performed on aggregates • Sampling for fast in-memory iterative modeling
  53. Data size and modeling quality

  54. – Peter Norvig, Research Director, Google “We don’t have better

    algorithms. We just have more data.”
  55. None
  56. None
  57. None
  58. More data beats better models?

  59. http://technocalifornia.blogspot.fr/2012/07/more-data-or-better- models.html

  60. Let’s train a parametric model to read handwritten digits from

    gray level pixels.
  61. None
  62. model stops improving

  63. None
  64. Bias vs Variance

  65. high bias

  66. high bias high variance

  67. high bias high variance low variance

  68. Variance solution #1: collect more samples

  69. Let’s train a non-parametric model to read handwritten digits from

    gray level pixels.
  70. None
  71. high variance almost no bias variance decreasing with #samples

  72. Bias solution #1: non-parametric models

  73. Type # rooms Surface (m2) Floor Public Transp. Apart. 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples
  74. Type # rooms Surface (m2) Floor Public Transp. School (km)

    Flood plain Apart. 3 65 2 Yes 1.0 No House 5 110 NA No 25.0 Yes Duplex 4 95 4 Yes 0.5 No features samples
  75. Data has 2 dimensions: # samples and # features

  76. Bias solution #2: richer features

  77. None
  78. • Parametric e.g. linear model (traditional stats) vs Non-parametric e.g.

    Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding
  79. • the (experimental) scientific method introduced by Karl Popper is

    based on the falsifiability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold