Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How to use scikit-learn to solve machine learni...
Search
Olivier Grisel
April 22, 2015
Technology
0
1k
How to use scikit-learn to solve machine learning problems
AutoML Hackathon - Paris - April 2015
Olivier Grisel
April 22, 2015
Tweet
Share
More Decks by Olivier Grisel
See All by Olivier Grisel
Intro to scikit-learn
ogrisel
5
680
An Intro to Deep Learning
ogrisel
1
250
Predictive Modeling and Deep Learning
ogrisel
2
350
Intro to scikit-learn and what's new in 0.17
ogrisel
1
340
Big Data, Predictive Modeling and tools
ogrisel
2
260
Recent Developments in Deep Learning
ogrisel
3
670
Documentation
ogrisel
2
230
Build and test wheel packages on Linux, OSX and Windows
ogrisel
2
330
Big Data and Predictive Modeling
ogrisel
3
230
Other Decks in Technology
See All in Technology
QAはソフトウェアエンジニアリングを学んで実践するのが大事なの
ymty
1
370
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
7.2k
(非公式) AWS Summit Japan と 海浜幕張 の歩き方 2025年版
coosuke
PRO
0
140
Amazon Q Developer for GitHubとAmplify Hosting でサクッとデジタル名刺を作ってみた
kmiya84377
0
2.6k
讓測試不再 BB! 從 BDD 到 CI/CD, 不靠人力也能 MVP
line_developers_tw
PRO
0
120
API の仕様から紐解く「MCP 入門」 ~MCP の「コンテキスト」って何だ?~
cdataj
0
140
doda開発 生成AI元年宣言!自家製AIエージェントから始める生産性改革 / doda Development Declaration of the First Year of Generated AI! Productivity Reforms Starting with Home-grown AI Agents
techtekt
0
120
開発効率と信頼性を両立する Ubieのプラットフォームエンジニアリング
teru0x1
0
130
活きてなかったデータを活かしてみた話 / Shirokane Kougyou vol 19
sansan_randd
1
220
IAMのマニアックな話 2025を執筆して、 見えてきたAWSアカウント管理の現在
nrinetcom
PRO
3
510
AWS と定理証明 〜ポリシー言語 Cedar 開発の舞台裏〜 #fp_matsuri / FP Matsuri 2025
ytaka23
9
2.4k
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
5
38k
Featured
See All Featured
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
20
1.3k
Visualization
eitanlees
146
16k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.3k
Writing Fast Ruby
sferik
628
61k
GraphQLとの向き合い方2022年版
quramy
46
14k
The Language of Interfaces
destraynor
158
25k
Documentation Writing (for coders)
carmenintech
71
4.9k
Typedesign – Prime Four
hannesfritz
42
2.7k
Making Projects Easy
brettharned
116
6.2k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.5k
Docker and Python
trallard
44
3.4k
How to train your dragon (web standard)
notwaldorf
92
6.1k
Transcript
How to use scikit-learn to solve machine learning problems AutoML
Hackathon April 2015
Outline • Machine Learning refresher • scikit-learn • Demo: interactive
predictive modeling on Census Data with IPython notebook / pandas / scikit-learn • Combining models with Pipeline and parameter search
Predictive modeling ~= machine learning • Make predictions of outcome
on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train)
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train) Apartment 2 33 TRUE House 4 210 TRUE samples (test) ? ?
Training text docs images sounds transactions Labels Machine Learning Algorithm
Model Predictive Modeling Data Flow Feature vectors
New text doc image sound transaction Model Expected Label Predictive
Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
Inventory forecasting & trends detection Predictive modeling in the wild
Personalized radios Fraud detection Virality and readers engagement Predictive maintenance Personality matching
• Library of Machine Learning algorithms • Focus on established
methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = ModelClass(**hyperparams) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
Support Vector Machine from sklearn.svm import SVC model = SVC(kernel="rbf",
C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Linear Classifier from sklearn.linear_model import SGDClassifier model = SGDClassifier(alpha=1e-4, penalty="elasticnet")
model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train,
y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
None
None
Demo time! http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/ master/sklearn_demos/Income%20classification.ipynb https://github.com/ogrisel/notebooks
Combining Models from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train)
Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA from
sklearn.svm import SVC from sklearn.pipeline import make_pipeline pipeline = make_pipeline( StandardScaler(), RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), ) pipeline.fit(X_train, y_train)
Scoring manually stacked models scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)
pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train) X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) y_pred = svm.predict(X_test_pca) accuracy_score(y_test, y_pred)
Scoring a pipeline pipeline = make_pipeline( RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), )
pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy_score(y_test, y_pred)
Parameter search import numpy as np from sklearn.grid_search import RandomizedSearchCV
params = { 'randomizedpca__n_components': [5, 10, 20], 'svc__C': np.logspace(-3, 3, 7), 'svc__gamma': np.logspace(-6, 0, 7), } search = RandomizedSearchCV(pipeline, params, n_iter=30, cv=5) search.fit(X_train, y_train) # search.best_params_, search.grid_scores_
Thank you! • http://scikit-learn.org • https://github.com/scikit-learn/scikit-learn @ogrisel