Predictive Analytics

Predictive Analytics Olivier Grisel — @ogrisel — June 16-17 2014

About me

Big Data as a buzzword

Triumph of the Nerds: Nate Silver Wins in 50 States
http://mashable.com/2012/11/07/nate-silver-wins/

Nate Silver’s election model, Big Data? $ git clone gh:jseabold/538model!
! $ du -h 538model/data! 188K 538model/data

15% of the capacity of a 3’5 ﬂoppy disk

Regionator 3000 http://labs.data-publica.com/regionator3000/

http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

120% of the capacity of a 3’5 ﬂoppy disk

Big Data ≠ Predictive Analytics

Goals of this talk • What Big Data actually is
or isn’t • Introduce predictive analytics concepts & tools • Study the impact of data size on analytics

How big is Big Data?

– Wikipedia “Big data is a blanket term for any
collection of data sets so large and complex that it becomes difﬁcult to process using on-hand database management tools or traditional data processing applications.”

Not Big Data • Data that ﬁts on a spreadsheet
• Data that can be analyzed in RAM (< 100 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server

Reading the full content of a 1TB HDD at 100MB/s:
! 2 hours 45 minutes

Canonical Big Data problem: indexing the Web • Inverted index
on tera bytes of text data • Process each HTML page as a URL + bag of words • For each word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially

Non-traditional architectures • Hadoop: HDFS / MapReduce, Pig, Hive •
Sharded, replicated NoSQL: BigTable, DynamoDB, Cassandra, HBase, ElasticSearch • Distributed event stream processing Kafka, Storm • Next gen cluster processing / distributed analytical DB YARN / Tez, Spark, Impala, PrestoDB, Redshift…

How heavy is a copy of the web?

1TB ≈ 1Kg ! when stored on a hadoop node

All HTML pages ≈ 100 TB ≈ 100 Kg

All the web ≈ 10 PB ≈ 10 tonnes

Other Big Data examples • GSM location event log from
telco • Transaction log of a big retail network • Raw trafﬁc data on a large website or app • Intra-day tick data from a stock exchange

Not Big Data • Polls data (~10K data points) •
Census data (~10M data points) • Real estate transactions data (~10M data points) • Open / High / Low / Close (OHLC) stock prices  (~10K data points) • Any dataset publicly available for download

What is Predictive Analytics? and Machine Learning

• Make predictions of outcome on new data • Alternative
to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model

Type # rooms Surface m2 Floor Public Transports Apartment 3
65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes

65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples

65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target

65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes

65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes ?

Applications in Business • Forecast sales, customer churn, trafﬁc, prices
• Predict CTR and optimal bid price for online ads • Build computer vision systems for robots in the industry and agriculture • Detect network anomalies, fraud and spams • Recommend products, movies, music

Applications in Science • Decode the activity of the brain
recorded via fMRI / EEG / MEG • Decode gene expression data to model regulatory networks • Predict the distance of each star in the sky • Identify the Higgs boson in proton-proton collisions

Training! text docs! images! sounds! transactions Predictive Modeling Data Flow

Training! text docs! images! sounds! transactions Labels Predictive Modeling Data
Flow

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm
Predictive Modeling Data Flow Feature vectors

Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm
Model Predictive Modeling Data Flow Feature vectors

New! text doc! image! sound! transaction Model Expected! Label Predictive
Modeling Data Flow Feature vector Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors

Tools for predictive analytics

SPSS MATLAB

New! text doc! image! sound! transaction Model Expected! Label Small
data Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

/ Medium data Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

/ Medium data with Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector

Predictive Analytics on Big Data

Model Expected! Label Big data with Machine! Learning! Algorithm New!
text doc! image! sound! transaction Training! text docs! images! sounds! transactions Labels Feature vectors Feature vector

BIG DATA

BIG DATA small(er) data

From Big to Small • Feature extraction often shrinks data
• Filter / Join / Group By / Count • Machine Learning performed on a small aggregate • Sampling for fast in-memory iterative modeling

Back to the Regionator What if we did not have
census data on daily mobility?

Back to the Regionator • Use raw daily telco logs
• Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count

Data size and modeling quality

– Peter Norvig, Research Director, Google “We don’t have better
algorithms. We just have more data.”

More data beats better models?

http://technocalifornia.blogspot.fr/2012/07/more-data-or-better- models.html

Let’s train a parametric model to read handwritten digits from
gray level pixels.

model stops improving

Bias vs Variance

high bias

high bias high variance

high bias high variance low variance

Variance solution #1: collect more samples

Let’s train a non-parametric model to read handwritten digits from
gray level pixels.

high variance almost no bias ! variance decreasing with #samples

Bias solution #1: non-parametric models

Type # rooms Surface (m2) Floor Public Transp. Apart. 3
65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples

Type # rooms Surface (m2) Floor Public Transp. School (km)
Flood plain Apart. 3 65 2 Yes 1.0 No House 5 110 NA No 25.0 Yes Duplex 4 95 4 Yes 0.5 No features samples

Bias solution #2: richer features

Data has 2 dimensions: ! # samples and # features

Key takeaway points:

• Big Data ≠ Predictive Analytics • Predictive models are
often built from small aggregate data (with sampling) << raw data • Modeling requires interactive / fast iterations • More data generally helps build better models but not always: noise or inadequate repr. • 2 dimensions: # samples & # features

Thank you! Questions? ! @Inria @ogrisel http://scikit-learn.org

Bonus track

• Parametric e.g. linear model (traditional stats) vs Non-parametric e.g.
Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding

• the (experimental) scientiﬁc method introduced by Karl Popper is
based on the falsiﬁability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold

Predictive Analytics

Predictive Analytics

More Decks by Olivier Grisel

Other Decks in Technology

Featured

Transcript