Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Intro to scikit-learn
Olivier Grisel
August 27, 2017
Technology
5
580
Intro to scikit-learn
EuroScipy 2017
Olivier Grisel
August 27, 2017
Tweet
Share
More Decks by Olivier Grisel
See All by Olivier Grisel
An Intro to Deep Learning
ogrisel
1
190
Predictive Modeling and Deep Learning
ogrisel
2
310
Intro to scikit-learn and what's new in 0.17
ogrisel
1
230
Big Data, Predictive Modeling and tools
ogrisel
2
220
Recent Developments in Deep Learning
ogrisel
3
640
Documentation
ogrisel
2
110
How to use scikit-learn to solve machine learning problems
ogrisel
0
700
Build and test wheel packages on Linux, OSX and Windows
ogrisel
2
260
Big Data and Predictive Modeling
ogrisel
3
160
Other Decks in Technology
See All in Technology
情報の世界 2022年度 第11回「都市のデータ」 #情報の世界 / Data of City 2022
yumulab
0
110
2024卒_freee_エンジニア職(ポテンシャル採用)_説明資料
freee
0
300
The Fractal Geometry of Software Design
vladikk
1
1.3k
Modern Android dependency injection
hugovisser
1
130
RDRA + JavaによるレジャーSaaSプロダクトの要件定義と実装のシームレスな接続
jjebejj
PRO
3
760
The application of formal methods in Kafka reliability engineering
line_developers
PRO
1
210
HoloLens2とMetaQuest2どちらも動くWebXRアプリをBabylon.jsで作る
iwaken71
0
210
Strategyパターン
hankehly
0
160
Data in Google I/O - IO Extended GDG Seoul
kennethanceyer
0
160
ソフトウェアライセンス 2022 / Software License 2022
cybozuinsideout
PRO
1
1.2k
OPENLOGI Company Profile
hr01
0
640
Persistence in Serverless Applications - ServerlessDays NYC
marcduiker
0
250
Featured
See All Featured
Fontdeck: Realign not Redesign
paulrobertlloyd
73
4.1k
Creatively Recalculating Your Daily Design Routine
revolveconf
207
10k
The Invisible Customer
myddelton
110
11k
GraphQLとの向き合い方2022年版
quramy
16
8.3k
Making Projects Easy
brettharned
98
4.3k
A Modern Web Designer's Workflow
chriscoyier
689
180k
From Idea to $5000 a Month in 5 Months
shpigford
373
44k
How to name files
jennybc
40
61k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
11
4.7k
For a Future-Friendly Web
brad_frost
166
7.4k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
10
3.4k
Build your cross-platform service in a week with App Engine
jlugia
219
17k
Transcript
Intro to scikit-learn EuroScipy 2017 - Olivier Grisel - Tim
Head
Outline • Machine Learning refresher • scikit-learn • Hands on:
interactive predictive modeling on Census Data with Jupyter notebook / pandas / scikit-learn • Hands on: parameter tuning with scikit-optimize
Predictive modeling ~= machine learning • Make predictions of outcome
on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train)
type (category) # rooms (int) surface (float m2) public trans
(boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train) Apartment 2 33 TRUE House 4 210 TRUE samples (test) ? ?
Training text docs images sounds transactions Labels Machine Learning Algorithm
Model Predictive Modeling Data Flow Feature vectors
New text doc image sound transaction Model Expected Label Predictive
Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
Inventory forecasting & trends detection Predictive modeling in the wild
Personalized radios Fraud detection Virality and readers engagement Predictive maintenance Personality matching
• Library of Machine Learning algorithms • Focus on established
methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Train data Train labels Model Fitted model Test data Predicted
labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
Support Vector Machine from sklearn.svm import SVC model = SVC(kernel="rbf",
C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Linear Classifier from sklearn.linear_model import SGDClassifier model = SGDClassifier(alpha=1e-4, penalty="elasticnet")
model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train,
y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
None
None
Workshop time! https://github.com/ogrisel/euroscipy_2017_sklearn
Combining Models from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train)
Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA from
sklearn.svm import SVC from sklearn.pipeline import make_pipeline pipeline = make_pipeline( StandardScaler(), RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), ) pipeline.fit(X_train, y_train)
Scoring manually stacked models scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)
pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train) X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) y_pred = svm.predict(X_test_pca) accuracy_score(y_test, y_pred)
Scoring a pipeline pipeline = make_pipeline( RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), )
pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy_score(y_test, y_pred)
Parameter search import numpy as np from sklearn.grid_search import RandomizedSearchCV
params = { 'randomizedpca__n_components': [5, 10, 20], 'svc__C': np.logspace(-3, 3, 7), 'svc__gamma': np.logspace(-6, 0, 7), } search = RandomizedSearchCV(pipeline, params, n_iter=30, cv=5) search.fit(X_train, y_train) # search.best_params_, search.grid_scores_
Thank you! • http://scikit-learn.org • https://github.com/scikit-learn/scikit-learn @ogrisel