Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How to use scikit-learn to solve machine learni...
Search
Olivier Grisel
April 22, 2015
Technology
0
1k
How to use scikit-learn to solve machine learning problems
AutoML Hackathon - Paris - April 2015
Olivier Grisel
April 22, 2015
Tweet
Share
More Decks by Olivier Grisel
See All by Olivier Grisel
Intro to scikit-learn
ogrisel
5
690
An Intro to Deep Learning
ogrisel
1
270
Predictive Modeling and Deep Learning
ogrisel
2
350
Intro to scikit-learn and what's new in 0.17
ogrisel
1
350
Big Data, Predictive Modeling and tools
ogrisel
2
270
Recent Developments in Deep Learning
ogrisel
3
670
Documentation
ogrisel
2
230
Build and test wheel packages on Linux, OSX and Windows
ogrisel
2
330
Big Data and Predictive Modeling
ogrisel
3
230
Other Decks in Technology
See All in Technology
マーケットプレイス版Oracle WebCenter Content For OCI
oracle4engineer
PRO
3
960
さくらのIaaS基盤のモニタリングとOpenTelemetry/OSC Hokkaido 2025
fujiwara3
3
440
Operating Operator
shhnjk
1
580
改めてAWS WAFを振り返る~業務で使うためのポイント~
masakiokuda
2
250
開発生産性を組織全体の「生産性」へ! 部門間連携の壁を越える実践的ステップ
sudo5in5k
2
7k
KubeCon + CloudNativeCon Japan 2025 Recap by CA
ponkio_o
PRO
0
300
Delta airlines Customer®️ USA Contact Numbers: Complete 2025 Support Guide
deltahelp
0
680
United airlines®️ USA Contact Numbers: Complete 2025 Support Guide
unitedflyhelp
0
310
事業成長の裏側:エンジニア組織と開発生産性の進化 / 20250703 Rinto Ikenoue
shift_evolve
PRO
2
21k
生成AI時代の開発組織・技術・プロセス 〜 ログラスの挑戦と考察 〜
itohiro73
1
460
ゼロからはじめる採用広報
yutadayo
3
910
Backlog ユーザー棚卸しRTA、多分これが一番早いと思います
__allllllllez__
1
150
Featured
See All Featured
Being A Developer After 40
akosma
90
590k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
6
300
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
30
2.1k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
Designing for humans not robots
tammielis
253
25k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
281
13k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
161
15k
Optimizing for Happiness
mojombo
379
70k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
2.9k
Transcript
How to use scikit-learn to solve machine learning problems AutoML
Hackathon April 2015
Outline • Machine Learning refresher • scikit-learn • Demo: interactive
predictive modeling on Census Data with IPython notebook / pandas / scikit-learn • Combining models with Pipeline and parameter search
Predictive modeling ~= machine learning • Make predictions of outcome
on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train)
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train) Apartment 2 33 TRUE House 4 210 TRUE samples (test) ? ?
Training text docs images sounds transactions Labels Machine Learning Algorithm
Model Predictive Modeling Data Flow Feature vectors
New text doc image sound transaction Model Expected Label Predictive
Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
Inventory forecasting & trends detection Predictive modeling in the wild
Personalized radios Fraud detection Virality and readers engagement Predictive maintenance Personality matching
• Library of Machine Learning algorithms • Focus on established
methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
Support Vector Machine from sklearn.svm import SVC model = SVC(kernel="rbf",
C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Linear Classifier from sklearn.linear_model import SGDClassifier model = SGDClassifier(alpha=1e-4, penalty="elasticnet")
model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train,
y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
None
None
Demo time! http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/ master/sklearn_demos/Income%20classification.ipynb https://github.com/ogrisel/notebooks
Combining Models from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train)
Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA from
sklearn.svm import SVC from sklearn.pipeline import make_pipeline pipeline = make_pipeline( StandardScaler(), RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), ) pipeline.fit(X_train, y_train)
Scoring manually stacked models scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)
pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train) X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) y_pred = svm.predict(X_test_pca) accuracy_score(y_test, y_pred)
Scoring a pipeline pipeline = make_pipeline( RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), )
pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy_score(y_test, y_pred)
Parameter search import numpy as np from sklearn.grid_search import RandomizedSearchCV
params = { 'randomizedpca__n_components': [5, 10, 20], 'svc__C': np.logspace(-3, 3, 7), 'svc__gamma': np.logspace(-6, 0, 7), } search = RandomizedSearchCV(pipeline, params, n_iter=30, cv=5) search.fit(X_train, y_train) # search.best_params_, search.grid_scores_
Thank you! • http://scikit-learn.org • https://github.com/scikit-learn/scikit-learn @ogrisel