Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predictive Analytics

Predictive Analytics

Presentation given at USI 2014, Paris. Contrasts Big Data vs Analytics and explore how the two interact.

Video of the talk in French: https://www.youtube.com/watch?v=R8QLyBXlEYg

Olivier Grisel

June 16, 2014
Tweet

More Decks by Olivier Grisel

Other Decks in Technology

Transcript

  1. Predictive Analytics Olivier Grisel — @ogrisel — June 16-17 2014

  2. About me

  3. Big Data as a buzzword

  4. Triumph of the Nerds: Nate Silver Wins in 50 States

    http://mashable.com/2012/11/07/nate-silver-wins/
  5. Triumph of the Nerds: Nate Silver Wins in 50 States

    http://mashable.com/2012/11/07/nate-silver-wins/
  6. Nate Silver’s election model, Big Data? $ git clone gh:jseabold/538model!

    ! $ du -h 538model/data! 188K 538model/data
  7. 15% of the capacity of a 3’5 floppy disk

  8. Regionator 3000 http://labs.data-publica.com/regionator3000/

  9. http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

  10. http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

  11. http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

  12. http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

  13. 120% of the capacity of a 3’5 floppy disk

  14. Big Data ≠ Predictive Analytics

  15. Goals of this talk • What Big Data actually is

    or isn’t • Introduce predictive analytics concepts & tools • Study the impact of data size on analytics
  16. How big is Big Data?

  17. – Wikipedia “Big data is a blanket term for any

    collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”
  18. Not Big Data • Data that fits on a spreadsheet

    • Data that can be analyzed in RAM (< 100 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server
  19. Reading the full content of a 1TB HDD at 100MB/s:

    ! 2 hours 45 minutes
  20. Canonical Big Data problem: indexing the Web • Inverted index

    on tera bytes of text data • Process each HTML page as a URL + bag of words • For each word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially
  21. None
  22. None
  23. Non-traditional architectures • Hadoop: HDFS / MapReduce, Pig, Hive •

    Sharded, replicated NoSQL: BigTable, DynamoDB, Cassandra, HBase, ElasticSearch • Distributed event stream processing Kafka, Storm • Next gen cluster processing / distributed analytical DB YARN / Tez, Spark, Impala, PrestoDB, Redshift…
  24. How heavy is a copy of the web?

  25. 1TB ≈ 1Kg ! when stored on a hadoop node

  26. All HTML pages ≈ 100 TB ≈ 100 Kg

  27. None
  28. All the web ≈ 10 PB ≈ 10 tonnes

  29. Other Big Data examples • GSM location event log from

    telco • Transaction log of a big retail network • Raw traffic data on a large website or app • Intra-day tick data from a stock exchange
  30. Not Big Data • Polls data (~10K data points) •

    Census data (~10M data points) • Real estate transactions data (~10M data points) • Open / High / Low / Close (OHLC) stock prices
 (~10K data points) • Any dataset publicly available for download
  31. What is Predictive Analytics? and Machine Learning

  32. • Make predictions of outcome on new data • Alternative

    to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model
  33. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes
  34. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples
  35. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target
  36. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes
  37. Type # rooms Surface m2 Floor Public Transports Apartment 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes ?
  38. Applications in Business • Forecast sales, customer churn, traffic, prices

    • Predict CTR and optimal bid price for online ads • Build computer vision systems for robots in the industry and agriculture • Detect network anomalies, fraud and spams • Recommend products, movies, music
  39. Applications in Science • Decode the activity of the brain

    recorded via fMRI / EEG / MEG • Decode gene expression data to model regulatory networks • Predict the distance of each star in the sky • Identify the Higgs boson in proton-proton collisions
  40. Training! text docs! images! sounds! transactions Predictive Modeling Data Flow

  41. Training! text docs! images! sounds! transactions Labels Predictive Modeling Data

    Flow
  42. Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm

    Predictive Modeling Data Flow Feature vectors
  43. Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm

    Model Predictive Modeling Data Flow Feature vectors
  44. New! text doc! image! sound! transaction Model Expected! Label Predictive

    Modeling Data Flow Feature vector Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors
  45. Tools for predictive analytics

  46. SPSS MATLAB

  47. SPSS MATLAB

  48. New! text doc! image! sound! transaction Model Expected! Label Small

    data Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector
  49. New! text doc! image! sound! transaction Model Expected! Label Small

    / Medium data Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector
  50. New! text doc! image! sound! transaction Model Expected! Label Small

    / Medium data with Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector
  51. New! text doc! image! sound! transaction Model Expected! Label Small

    / Medium data with Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector
  52. None
  53. None
  54. Predictive Analytics on Big Data

  55. Model Expected! Label Big data with Machine! Learning! Algorithm New!

    text doc! image! sound! transaction Training! text docs! images! sounds! transactions Labels Feature vectors Feature vector
  56. Model Expected! Label Big data with Machine! Learning! Algorithm New!

    text doc! image! sound! transaction Training! text docs! images! sounds! transactions Labels Feature vectors Feature vector
  57. Model Expected! Label Big data with Machine! Learning! Algorithm New!

    text doc! image! sound! transaction Training! text docs! images! sounds! transactions Labels Feature vectors Feature vector
  58. None
  59. BIG DATA

  60. BIG DATA small(er) data

  61. BIG DATA small(er) data

  62. From Big to Small • Feature extraction often shrinks data

    • Filter / Join / Group By / Count • Machine Learning performed on a small aggregate • Sampling for fast in-memory iterative modeling
  63. Back to the Regionator What if we did not have

    census data on daily mobility?
  64. Back to the Regionator • Use raw daily telco logs

    • Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count
  65. Data size and modeling quality

  66. – Peter Norvig, Research Director, Google “We don’t have better

    algorithms. We just have more data.”
  67. None
  68. None
  69. None
  70. More data beats better models?

  71. http://technocalifornia.blogspot.fr/2012/07/more-data-or-better- models.html

  72. Let’s train a parametric model to read handwritten digits from

    gray level pixels.
  73. None
  74. model stops improving

  75. None
  76. Bias vs Variance

  77. high bias

  78. high bias high variance

  79. high bias high variance low variance

  80. Variance solution #1: collect more samples

  81. Let’s train a non-parametric model to read handwritten digits from

    gray level pixels.
  82. None
  83. high variance almost no bias ! variance decreasing with #samples

  84. Bias solution #1: non-parametric models

  85. Type # rooms Surface (m2) Floor Public Transp. Apart. 3

    65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples
  86. Type # rooms Surface (m2) Floor Public Transp. School (km)

    Flood plain Apart. 3 65 2 Yes 1.0 No House 5 110 NA No 25.0 Yes Duplex 4 95 4 Yes 0.5 No features samples
  87. Bias solution #2: richer features

  88. Data has 2 dimensions: ! # samples and # features

  89. Key takeaway points:

  90. • Big Data ≠ Predictive Analytics • Predictive models are

    often built from small aggregate data (with sampling) << raw data • Modeling requires interactive / fast iterations • More data generally helps build better models but not always: noise or inadequate repr. • 2 dimensions: # samples & # features
  91. Thank you! Questions? ! @Inria @ogrisel http://scikit-learn.org

  92. Bonus track

  93. • Parametric e.g. linear model (traditional stats) vs Non-parametric e.g.

    Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding
  94. • the (experimental) scientific method introduced by Karl Popper is

    based on the falsifiability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold