Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predictive Analytics

Predictive Analytics

Presentation given at USI 2014, Paris. Contrasts Big Data vs Analytics and explore how the two interact.

Video of the talk in French: https://www.youtube.com/watch?v=R8QLyBXlEYg

Olivier Grisel

June 16, 2014
Tweet

More Decks by Olivier Grisel

Other Decks in Technology

Transcript

  1. Predictive Analytics
    Olivier Grisel — @ogrisel — June 16-17 2014

    View full-size slide

  2. Big Data as a
    buzzword

    View full-size slide

  3. Triumph of the Nerds: Nate
    Silver Wins in 50 States
    http://mashable.com/2012/11/07/nate-silver-wins/

    View full-size slide

  4. Triumph of the Nerds: Nate
    Silver Wins in 50 States
    http://mashable.com/2012/11/07/nate-silver-wins/

    View full-size slide

  5. Nate Silver’s election model,
    Big Data?
    $ git clone gh:jseabold/538model!
    !
    $ du -h 538model/data!
    188K 538model/data

    View full-size slide

  6. 15% of the capacity of a 3’5 floppy disk

    View full-size slide

  7. Regionator 3000
    http://labs.data-publica.com/regionator3000/

    View full-size slide

  8. http://transports.blog.lemonde.fr/2014/06/05/regionator-la-
    carte-de-france-dessinee-par-les-trajets-quotidiens/

    View full-size slide

  9. http://transports.blog.lemonde.fr/2014/06/05/regionator-la-
    carte-de-france-dessinee-par-les-trajets-quotidiens/

    View full-size slide

  10. http://www.insee.fr/fr/themes/detail.asp?
    reg_id=99&ref_id=mobilite-professionnelle-10

    View full-size slide

  11. http://www.insee.fr/fr/themes/detail.asp?
    reg_id=99&ref_id=mobilite-professionnelle-10

    View full-size slide

  12. 120% of the capacity of a 3’5 floppy disk

    View full-size slide

  13. Big Data

    Predictive Analytics

    View full-size slide

  14. Goals of this talk
    • What Big Data actually is or isn’t
    • Introduce predictive analytics concepts & tools
    • Study the impact of data size on analytics

    View full-size slide

  15. How big is Big Data?

    View full-size slide

  16. – Wikipedia
    “Big data is a blanket term for any collection of
    data sets so large and complex that it
    becomes difficult to process using on-hand
    database management tools or traditional data
    processing applications.”

    View full-size slide

  17. Not Big Data
    • Data that fits on a spreadsheet
    • Data that can be analyzed in RAM (< 100 GB)
    • Data operations that can be performed quickly
    by a traditional database, e.g. single node
    PostgreSQL server

    View full-size slide

  18. Reading the full content of a 1TB
    HDD at 100MB/s:
    !
    2 hours 45 minutes

    View full-size slide

  19. Canonical Big Data problem:
    indexing the Web
    • Inverted index on tera bytes of text data
    • Process each HTML page as a URL + bag of
    words
    • For each word, aggregate the list of page URLs
    • 2 billion HTML pages: 100TB
    >10 days just to read sequentially

    View full-size slide

  20. Non-traditional architectures
    • Hadoop: HDFS / MapReduce, Pig, Hive
    • Sharded, replicated NoSQL:
    BigTable, DynamoDB, Cassandra, HBase, ElasticSearch
    • Distributed event stream processing
    Kafka, Storm
    • Next gen cluster processing / distributed analytical DB
    YARN / Tez, Spark, Impala, PrestoDB, Redshift…

    View full-size slide

  21. How heavy is a copy
    of the web?

    View full-size slide

  22. 1TB ≈ 1Kg
    !
    when stored on a hadoop node

    View full-size slide

  23. All HTML pages ≈ 100 TB
    ≈ 100 Kg

    View full-size slide

  24. All the web ≈ 10 PB
    ≈ 10 tonnes

    View full-size slide

  25. Other Big Data examples
    • GSM location event log from telco
    • Transaction log of a big retail network
    • Raw traffic data on a large website or app
    • Intra-day tick data from a stock exchange

    View full-size slide

  26. Not Big Data
    • Polls data (~10K data points)
    • Census data (~10M data points)
    • Real estate transactions data (~10M data points)
    • Open / High / Low / Close (OHLC) stock prices

    (~10K data points)
    • Any dataset publicly available for download

    View full-size slide

  27. What is Predictive
    Analytics?
    and Machine Learning

    View full-size slide

  28. • Make predictions of outcome on new data
    • Alternative to hard-coded rules written by
    experts
    • Extract the structure of historical data
    • Statistical tools to summarize the training data
    into a executable predictive model

    View full-size slide

  29. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes

    View full-size slide

  30. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples

    View full-size slide

  31. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples
    Sold
    300k
    1.5M
    2.2M
    target

    View full-size slide

  32. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples
    Sold
    300k
    1.5M
    2.2M
    target
    Apartment 2 35 3 Yes

    View full-size slide

  33. Type # rooms
    Surface
    m2
    Floor
    Public
    Transports
    Apartment 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples
    Sold
    300k
    1.5M
    2.2M
    target
    Apartment 2 35 3 Yes ?

    View full-size slide

  34. Applications in Business
    • Forecast sales, customer churn, traffic, prices
    • Predict CTR and optimal bid price for online ads
    • Build computer vision systems for robots in the
    industry and agriculture
    • Detect network anomalies, fraud and spams
    • Recommend products, movies, music

    View full-size slide

  35. Applications in Science
    • Decode the activity of the brain recorded via
    fMRI / EEG / MEG
    • Decode gene expression data to model
    regulatory networks
    • Predict the distance of each star in the sky
    • Identify the Higgs boson in proton-proton
    collisions

    View full-size slide

  36. Training!
    text docs!
    images!
    sounds!
    transactions
    Predictive Modeling Data Flow

    View full-size slide

  37. Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Predictive Modeling Data Flow

    View full-size slide

  38. Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Machine!
    Learning!
    Algorithm
    Predictive Modeling Data Flow
    Feature vectors

    View full-size slide

  39. Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Machine!
    Learning!
    Algorithm
    Model
    Predictive Modeling Data Flow
    Feature vectors

    View full-size slide

  40. New!
    text doc!
    image!
    sound!
    transaction
    Model
    Expected!
    Label
    Predictive Modeling Data Flow
    Feature vector
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Machine!
    Learning!
    Algorithm
    Feature vectors

    View full-size slide

  41. Tools for predictive
    analytics

    View full-size slide

  42. New!
    text doc!
    image!
    sound!
    transaction
    Model
    Expected!
    Label
    Small data
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Machine!
    Learning!
    Algorithm
    Feature vectors
    Feature vector

    View full-size slide

  43. New!
    text doc!
    image!
    sound!
    transaction
    Model
    Expected!
    Label
    Small / Medium data
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Machine!
    Learning!
    Algorithm
    Feature vectors
    Feature vector

    View full-size slide

  44. New!
    text doc!
    image!
    sound!
    transaction
    Model
    Expected!
    Label
    Small / Medium data with
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Machine!
    Learning!
    Algorithm
    Feature vectors
    Feature vector

    View full-size slide

  45. New!
    text doc!
    image!
    sound!
    transaction
    Model
    Expected!
    Label
    Small / Medium data with
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Machine!
    Learning!
    Algorithm
    Feature vectors
    Feature vector

    View full-size slide

  46. Predictive Analytics on
    Big Data

    View full-size slide

  47. Model
    Expected!
    Label
    Big data with
    Machine!
    Learning!
    Algorithm
    New!
    text doc!
    image!
    sound!
    transaction
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Feature vectors
    Feature vector

    View full-size slide

  48. Model
    Expected!
    Label
    Big data with
    Machine!
    Learning!
    Algorithm
    New!
    text doc!
    image!
    sound!
    transaction
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Feature vectors
    Feature vector

    View full-size slide

  49. Model
    Expected!
    Label
    Big data with
    Machine!
    Learning!
    Algorithm
    New!
    text doc!
    image!
    sound!
    transaction
    Training!
    text docs!
    images!
    sounds!
    transactions
    Labels
    Feature vectors
    Feature vector

    View full-size slide

  50. BIG DATA small(er) data

    View full-size slide

  51. BIG DATA small(er) data

    View full-size slide

  52. From Big to Small
    • Feature extraction often shrinks data
    • Filter / Join / Group By / Count
    • Machine Learning performed on a small
    aggregate
    • Sampling for fast in-memory iterative modeling

    View full-size slide

  53. Back to the Regionator
    What if we did not have census data on daily mobility?

    View full-size slide

  54. Back to the Regionator
    • Use raw daily telco logs
    • Group By (phone, day) to
    extract daily trips
    • Join By GPS coordinates to
    “departement” names
    • Filter out small trips
    • Group By (home, work)
    “departements”
    • Count

    View full-size slide

  55. Data size and
    modeling quality

    View full-size slide

  56. – Peter Norvig, Research Director, Google
    “We don’t have better algorithms. We just have
    more data.”

    View full-size slide

  57. More data beats
    better models?

    View full-size slide

  58. http://technocalifornia.blogspot.fr/2012/07/more-data-or-better-
    models.html

    View full-size slide

  59. Let’s train a parametric model
    to read handwritten digits from gray level pixels.

    View full-size slide

  60. model stops improving

    View full-size slide

  61. Bias vs Variance

    View full-size slide

  62. high bias
    high variance

    View full-size slide

  63. high bias
    high variance
    low variance

    View full-size slide

  64. Variance solution #1:
    collect more samples

    View full-size slide

  65. Let’s train a non-parametric model
    to read handwritten digits from gray level pixels.

    View full-size slide

  66. high variance
    almost no bias
    !
    variance
    decreasing
    with #samples

    View full-size slide

  67. Bias solution #1:
    non-parametric models

    View full-size slide

  68. Type # rooms
    Surface
    (m2)
    Floor
    Public
    Transp.
    Apart. 3 65 2 Yes
    House 5 110 NA No
    Duplex 4 95 4 Yes
    features
    samples

    View full-size slide

  69. Type # rooms
    Surface
    (m2)
    Floor
    Public
    Transp.
    School
    (km)
    Flood
    plain
    Apart. 3 65 2 Yes 1.0 No
    House 5 110 NA No 25.0 Yes
    Duplex 4 95 4 Yes 0.5 No
    features
    samples

    View full-size slide

  70. Bias solution #2:
    richer features

    View full-size slide

  71. Data has 2 dimensions:
    !
    # samples and # features

    View full-size slide

  72. Key takeaway points:

    View full-size slide

  73. • Big Data ≠ Predictive Analytics
    • Predictive models are often built from small
    aggregate data (with sampling) << raw data
    • Modeling requires interactive / fast iterations
    • More data generally helps build better models
    but not always: noise or inadequate repr.
    • 2 dimensions: # samples & # features

    View full-size slide

  74. Thank you!
    Questions?
    !
    @Inria @ogrisel http://scikit-learn.org

    View full-size slide

  75. • Parametric e.g. linear model (traditional stats) vs
    Non-parametric e.g. Random Forests, Neural
    Networks (Machine Learning)
    • Understand a model with 10% accuracy vs blindly
    trust a model with 90% accuracy
    • Simple models e.g. F = m a, F = - G (m1 + m2) /
    r^2 will not become false(r) because of big data
    • New problems can be tackled: computer vision,
    speech recognition, natural language
    understanding

    View full-size slide

  76. • the (experimental) scientific method introduced
    by Karl Popper is based on the falsifiability of
    formulated hypotheses
    • theory is correct as long as past predictions hold
    in new experiments
    • machine learning train-validation-test splits and
    cross-validation is similar in spirit
    • ml model is just a complex theory: correct as
    long as its predictions still hold

    View full-size slide