Slide 1

Slide 1 text

Predictive Analytics Olivier Grisel — @ogrisel — June 16-17 2014

Slide 2

Slide 2 text

About me

Slide 3

Slide 3 text

Big Data as a buzzword

Slide 4

Slide 4 text

Triumph of the Nerds: Nate Silver Wins in 50 States http://mashable.com/2012/11/07/nate-silver-wins/

Slide 5

Slide 5 text

Triumph of the Nerds: Nate Silver Wins in 50 States http://mashable.com/2012/11/07/nate-silver-wins/

Slide 6

Slide 6 text

Nate Silver’s election model, Big Data? $ git clone gh:jseabold/538model! ! $ du -h 538model/data! 188K 538model/data

Slide 7

Slide 7 text

15% of the capacity of a 3’5 floppy disk

Slide 8

Slide 8 text

Regionator 3000 http://labs.data-publica.com/regionator3000/

Slide 9

Slide 9 text

http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

Slide 10

Slide 10 text

http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

Slide 11

Slide 11 text

http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

Slide 12

Slide 12 text

http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

Slide 13

Slide 13 text

120% of the capacity of a 3’5 floppy disk

Slide 14

Slide 14 text

Big Data ≠ Predictive Analytics

Slide 15

Slide 15 text

Goals of this talk • What Big Data actually is or isn’t • Introduce predictive analytics concepts & tools • Study the impact of data size on analytics

Slide 16

Slide 16 text

How big is Big Data?

Slide 17

Slide 17 text

– Wikipedia “Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.”

Slide 18

Slide 18 text

Not Big Data • Data that fits on a spreadsheet • Data that can be analyzed in RAM (< 100 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server

Slide 19

Slide 19 text

Reading the full content of a 1TB HDD at 100MB/s: ! 2 hours 45 minutes

Slide 20

Slide 20 text

Canonical Big Data problem: indexing the Web • Inverted index on tera bytes of text data • Process each HTML page as a URL + bag of words • For each word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Non-traditional architectures • Hadoop: HDFS / MapReduce, Pig, Hive • Sharded, replicated NoSQL: BigTable, DynamoDB, Cassandra, HBase, ElasticSearch • Distributed event stream processing Kafka, Storm • Next gen cluster processing / distributed analytical DB YARN / Tez, Spark, Impala, PrestoDB, Redshift…

Slide 24

Slide 24 text

How heavy is a copy of the web?

Slide 25

Slide 25 text

1TB ≈ 1Kg ! when stored on a hadoop node

Slide 26

Slide 26 text

All HTML pages ≈ 100 TB ≈ 100 Kg

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

All the web ≈ 10 PB ≈ 10 tonnes

Slide 29

Slide 29 text

Other Big Data examples • GSM location event log from telco • Transaction log of a big retail network • Raw traffic data on a large website or app • Intra-day tick data from a stock exchange

Slide 30

Slide 30 text

Not Big Data • Polls data (~10K data points) • Census data (~10M data points) • Real estate transactions data (~10M data points) • Open / High / Low / Close (OHLC) stock prices
 (~10K data points) • Any dataset publicly available for download

Slide 31

Slide 31 text

What is Predictive Analytics? and Machine Learning

Slide 32

Slide 32 text

• Make predictions of outcome on new data • Alternative to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model

Slide 33

Slide 33 text

Type # rooms Surface m2 Floor Public Transports Apartment 3 65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes

Slide 34

Slide 34 text

Type # rooms Surface m2 Floor Public Transports Apartment 3 65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples

Slide 35

Slide 35 text

Type # rooms Surface m2 Floor Public Transports Apartment 3 65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target

Slide 36

Slide 36 text

Type # rooms Surface m2 Floor Public Transports Apartment 3 65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes

Slide 37

Slide 37 text

Type # rooms Surface m2 Floor Public Transports Apartment 3 65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes ?

Slide 38

Slide 38 text

Applications in Business • Forecast sales, customer churn, traffic, prices • Predict CTR and optimal bid price for online ads • Build computer vision systems for robots in the industry and agriculture • Detect network anomalies, fraud and spams • Recommend products, movies, music

Slide 39

Slide 39 text

Applications in Science • Decode the activity of the brain recorded via fMRI / EEG / MEG • Decode gene expression data to model regulatory networks • Predict the distance of each star in the sky • Identify the Higgs boson in proton-proton collisions

Slide 40

Slide 40 text

Training! text docs! images! sounds! transactions Predictive Modeling Data Flow

Slide 41

Slide 41 text

Training! text docs! images! sounds! transactions Labels Predictive Modeling Data Flow

Slide 42

Slide 42 text

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Predictive Modeling Data Flow Feature vectors

Slide 43

Slide 43 text

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Model Predictive Modeling Data Flow Feature vectors

Slide 44

Slide 44 text

New! text doc! image! sound! transaction Model Expected! Label Predictive Modeling Data Flow Feature vector Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors

Slide 45

Slide 45 text

Tools for predictive analytics

Slide 46

Slide 46 text

SPSS MATLAB

Slide 47

Slide 47 text

SPSS MATLAB

Slide 48

Slide 48 text

New! text doc! image! sound! transaction Model Expected! Label Small data Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

Slide 49

Slide 49 text

New! text doc! image! sound! transaction Model Expected! Label Small / Medium data Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

Slide 50

Slide 50 text

New! text doc! image! sound! transaction Model Expected! Label Small / Medium data with Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

Slide 51

Slide 51 text

New! text doc! image! sound! transaction Model Expected! Label Small / Medium data with Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

Predictive Analytics on Big Data

Slide 55

Slide 55 text

Model Expected! Label Big data with Machine! Learning! Algorithm New! text doc! image! sound! transaction Training! text docs! images! sounds! transactions Labels Feature vectors Feature vector

Slide 56

Slide 56 text

Model Expected! Label Big data with Machine! Learning! Algorithm New! text doc! image! sound! transaction Training! text docs! images! sounds! transactions Labels Feature vectors Feature vector

Slide 57

Slide 57 text

Model Expected! Label Big data with Machine! Learning! Algorithm New! text doc! image! sound! transaction Training! text docs! images! sounds! transactions Labels Feature vectors Feature vector

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

BIG DATA

Slide 60

Slide 60 text

BIG DATA small(er) data

Slide 61

Slide 61 text

BIG DATA small(er) data

Slide 62

Slide 62 text

From Big to Small • Feature extraction often shrinks data • Filter / Join / Group By / Count • Machine Learning performed on a small aggregate • Sampling for fast in-memory iterative modeling

Slide 63

Slide 63 text

Back to the Regionator What if we did not have census data on daily mobility?

Slide 64

Slide 64 text

Back to the Regionator • Use raw daily telco logs • Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count

Slide 65

Slide 65 text

Data size and modeling quality

Slide 66

Slide 66 text

– Peter Norvig, Research Director, Google “We don’t have better algorithms. We just have more data.”

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

More data beats better models?

Slide 71

Slide 71 text

http://technocalifornia.blogspot.fr/2012/07/more-data-or-better- models.html

Slide 72

Slide 72 text

Let’s train a parametric model to read handwritten digits from gray level pixels.

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

model stops improving

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

Bias vs Variance

Slide 77

Slide 77 text

high bias

Slide 78

Slide 78 text

high bias high variance

Slide 79

Slide 79 text

high bias high variance low variance

Slide 80

Slide 80 text

Variance solution #1: collect more samples

Slide 81

Slide 81 text

Let’s train a non-parametric model to read handwritten digits from gray level pixels.

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

high variance almost no bias ! variance decreasing with #samples

Slide 84

Slide 84 text

Bias solution #1: non-parametric models

Slide 85

Slide 85 text

Type # rooms Surface (m2) Floor Public Transp. Apart. 3 65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples

Slide 86

Slide 86 text

Type # rooms Surface (m2) Floor Public Transp. School (km) Flood plain Apart. 3 65 2 Yes 1.0 No House 5 110 NA No 25.0 Yes Duplex 4 95 4 Yes 0.5 No features samples

Slide 87

Slide 87 text

Bias solution #2: richer features

Slide 88

Slide 88 text

Data has 2 dimensions: ! # samples and # features

Slide 89

Slide 89 text

Key takeaway points:

Slide 90

Slide 90 text

• Big Data ≠ Predictive Analytics • Predictive models are often built from small aggregate data (with sampling) << raw data • Modeling requires interactive / fast iterations • More data generally helps build better models but not always: noise or inadequate repr. • 2 dimensions: # samples & # features

Slide 91

Slide 91 text

Thank you! Questions? ! @Inria @ogrisel http://scikit-learn.org

Slide 92

Slide 92 text

Bonus track

Slide 93

Slide 93 text

• Parametric e.g. linear model (traditional stats) vs Non-parametric e.g. Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding

Slide 94

Slide 94 text

• the (experimental) scientific method introduced by Karl Popper is based on the falsifiability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold