Big Data, Predictive Modeling and tools

Big Data, Predictive Modeling & Tools Olivier Grisel — @ogrisel
CCIP October 2015

About me

Big Data as a buzzword

Regionator 3000 http://labs.data-publica.com/regionator3000/

http://transports.blog.lemonde.fr/2014/06/05/regionator-la- carte-de-france-dessinee-par-les-trajets-quotidiens/

http://www.insee.fr/fr/themes/detail.asp? reg_id=99&ref_id=mobilite-professionnelle-10

120% of the capacity of a 3’5 ﬂoppy disk

Big Data ≠ Predictive Analytics

Outline • What Big Data actually is or isn’t •
Predictive modeling concepts & tools • Big Data architecture & tools

How big is Big Data?

Not Big Data • Data that ﬁts on a spreadsheet
• Data that can be analyzed in RAM (< 10 GB) • Data operations that can be performed quickly by a traditional database, e.g. single node PostgreSQL server

Reading the full content of a 1TB HDD at 100MB/s:
2 hours 45 minutes

Canonical Big Data problem: indexing the Web • For each
word, aggregate the list of page URLs • 2 billion HTML pages: 100TB >10 days just to read sequentially

Other Big Data examples • GSM location event log from
telco • Transaction log of a big retail network • Activity records for a service with 10s of millions of users

What is Predictive Analytics? and Machine Learning

Type # rooms Surface m2 Floor Public Transports Apartment 3
65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target

Type # rooms Surface m2 Floor Public Transports Apartment 3
65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples Sold 300k 1.5M 2.2M target Apartment 2 35 3 Yes ?

Predictive Modeling • Automated predictions of outcome on new data
• Alternative to hard-coded rules written by experts • Extract the structure of historical data • Statistical tools to summarize the training data into an executable predictive model

Training text docs images sounds transactions Labels Machine Learning Algorithm
Model Predictive Modeling Data Flow Feature vectors

New text doc image sound transaction Model Expected Label Predictive
Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors

Predictive Models • Ex: Movie recommendations or targeted ads •
Embedded in a user service to make it more useful / attractive / proﬁtable • A wrong individual decision is not costly • Large number of small automated decisions • Humans would not be fast enough to make the predictions

Tools for predictive analytics

SPSS MATLAB

New text doc image sound transaction Model Expected Label Small
data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector

/ Medium data Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector

/ Medium data with Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors Feature vector

Model Expected Label Big data with Machine Learning Algorithm New
text doc image sound transaction Training text docs images sounds transactions Labels Feature vectors Feature vector

Big Data architecture(s) & tools

Distributed Event Log & Queues Distributed Event Stream Processing Distributed
Storage Distributed Batch Processing On-line Transaction Processing & App Views Analytical Database Predictive Models

Thank you! Questions? @Inria @ogrisel http://scikit-learn.org

Bonus tracks

Back to the Regionator What if we did not have
census data on daily mobility?

Back to the Regionator • Use raw daily telco logs
• Group By (phone, day) to extract daily trips • Join By GPS coordinates to “departement” names • Filter out small trips • Group By (home, work) “departements” • Count

Predictive Analytics on Big Data

BIG DATA

BIG DATA small(er) data

From Big to Small • Feature extraction often shrinks data
• Filter / Join / Group By / Count • Machine Learning performed on aggregates • Sampling for fast in-memory iterative modeling

Data size and modeling quality

– Peter Norvig, Research Director, Google “We don’t have better
algorithms. We just have more data.”

More data beats better models?

http://technocalifornia.blogspot.fr/2012/07/more-data-or-better- models.html

Let’s train a parametric model to read handwritten digits from
gray level pixels.

model stops improving

Bias vs Variance

high bias

high bias high variance

high bias high variance low variance

Variance solution #1: collect more samples

Let’s train a non-parametric model to read handwritten digits from
gray level pixels.

high variance almost no bias variance decreasing with #samples

Bias solution #1: non-parametric models

Type # rooms Surface (m2) Floor Public Transp. Apart. 3
65 2 Yes House 5 110 NA No Duplex 4 95 4 Yes features samples

Type # rooms Surface (m2) Floor Public Transp. School (km)
Flood plain Apart. 3 65 2 Yes 1.0 No House 5 110 NA No 25.0 Yes Duplex 4 95 4 Yes 0.5 No features samples

Data has 2 dimensions: # samples and # features

Bias solution #2: richer features

• Parametric e.g. linear model (traditional stats) vs Non-parametric e.g.
Random Forests, Neural Networks (Machine Learning) • Understand a model with 10% accuracy vs blindly trust a model with 90% accuracy • Simple models e.g. F = m a, F = - G (m1 + m2) / r^2 will not become false(r) because of big data • New problems can be tackled: computer vision, speech recognition, natural language understanding

• the (experimental) scientiﬁc method introduced by Karl Popper is
based on the falsiﬁability of formulated hypotheses • theory is correct as long as past predictions hold in new experiments • machine learning train-validation-test splits and cross-validation is similar in spirit • ml model is just a complex theory: correct as long as its predictions still hold

Big Data, Predictive Modeling and tools

Big Data, Predictive Modeling and tools

More Decks by Olivier Grisel

Other Decks in Technology

Featured

Transcript