Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How to use scikit-learn to solve machine learning problems
Search
Olivier Grisel
April 22, 2015
Technology
0
860
How to use scikit-learn to solve machine learning problems
AutoML Hackathon - Paris - April 2015
Olivier Grisel
April 22, 2015
Tweet
Share
More Decks by Olivier Grisel
See All by Olivier Grisel
Intro to scikit-learn
ogrisel
5
640
An Intro to Deep Learning
ogrisel
1
220
Predictive Modeling and Deep Learning
ogrisel
2
330
Intro to scikit-learn and what's new in 0.17
ogrisel
1
280
Big Data, Predictive Modeling and tools
ogrisel
2
240
Recent Developments in Deep Learning
ogrisel
3
650
Documentation
ogrisel
2
160
Build and test wheel packages on Linux, OSX and Windows
ogrisel
2
310
Big Data and Predictive Modeling
ogrisel
3
200
Other Decks in Technology
See All in Technology
カオナビの利用実績をアウトカムへつなげる旅 / example-of-data-management-startup-in-kaonavi
kaonavi
0
120
アプリがつくるNOT A HOTELブランド
hokuts
1
450
Tebiki株式会社 エンジニア採用資料
tebiki
0
4.1k
AWS を使う上で知っておきたいオンプレミス知識/aws-on-premise-essentials
emiki
1
4.2k
オブザーバビリティの Primary Signals
onk
PRO
0
550
エンタープライズ環境下での Active Directory の運用 TIPS
tamaiyutaro
1
1.6k
「手動オペレーションに定評がある」と言われた私が心がけていること / phpcon_odawara2024
blue_goheimochi
2
320
疲弊しない!AWSセキュリティ統制の考え方 #devio_osakaday1
masahirokawahara
6
5.9k
「ふりかえりのふりかえり」をふりかえり、実のあるふりかえりにする
naitosatoshi
0
220
普段有償でサポート業務をしているCSAが技術知見を無料で公開する理由
07jp27
1
640
オーナーシップを持つ領域を明確にする
konifar
11
2.6k
SPI原点回帰論:事業課題とFour Keysの結節点を見出す実践的ソフトウェアプロセス改善 / DevOpsDays Tokyo 2024
visional_engineering_and_design
4
1.6k
Featured
See All Featured
How to name files
jennybc
64
92k
The Art of Programming - Codeland 2020
erikaheidi
41
12k
Become a Pro
speakerdeck
PRO
10
4.5k
Design by the Numbers
sachag
274
18k
Debugging Ruby Performance
tmm1
70
11k
Designing for Performance
lara
601
67k
WebSockets: Embracing the real-time Web
robhawkes
59
7k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
13
1.5k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
119
38k
Faster Mobile Websites
deanohume
297
30k
Building a Modern Day E-commerce SEO Strategy
aleyda
16
6.4k
The Invisible Customer
myddelton
114
12k
Transcript
How to use scikit-learn to solve machine learning problems AutoML
Hackathon April 2015
Outline • Machine Learning refresher • scikit-learn • Demo: interactive
predictive modeling on Census Data with IPython notebook / pandas / scikit-learn • Combining models with Pipeline and parameter search
Predictive modeling ~= machine learning • Make predictions of outcome
on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train)
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train) Apartment 2 33 TRUE House 4 210 TRUE samples (test) ? ?
Training text docs images sounds transactions Labels Machine Learning Algorithm
Model Predictive Modeling Data Flow Feature vectors
New text doc image sound transaction Model Expected Label Predictive
Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
Inventory forecasting & trends detection Predictive modeling in the wild
Personalized radios Fraud detection Virality and readers engagement Predictive maintenance Personality matching
• Library of Machine Learning algorithms • Focus on established
methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
Support Vector Machine from sklearn.svm import SVC model = SVC(kernel="rbf",
C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Linear Classifier from sklearn.linear_model import SGDClassifier model = SGDClassifier(alpha=1e-4, penalty="elasticnet")
model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train,
y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
None
None
Demo time! http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/ master/sklearn_demos/Income%20classification.ipynb https://github.com/ogrisel/notebooks
Combining Models from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train)
Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA from
sklearn.svm import SVC from sklearn.pipeline import make_pipeline pipeline = make_pipeline( StandardScaler(), RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), ) pipeline.fit(X_train, y_train)
Scoring manually stacked models scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)
pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train) X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) y_pred = svm.predict(X_test_pca) accuracy_score(y_test, y_pred)
Scoring a pipeline pipeline = make_pipeline( RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), )
pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy_score(y_test, y_pred)
Parameter search import numpy as np from sklearn.grid_search import RandomizedSearchCV
params = { 'randomizedpca__n_components': [5, 10, 20], 'svc__C': np.logspace(-3, 3, 7), 'svc__gamma': np.logspace(-6, 0, 7), } search = RandomizedSearchCV(pipeline, params, n_iter=30, cv=5) search.fit(X_train, y_train) # search.best_params_, search.grid_scores_
Thank you! • http://scikit-learn.org • https://github.com/scikit-learn/scikit-learn @ogrisel