Slide 1

Slide 1 text

Sebastian Raschka! PyData Chicago 2016! Chicago, The University of Illinois • August 26, 2016! Learning – An Introduction to Machine Learning in Python

Slide 2

Slide 2 text

Sebastian Raschka, Learning scikit-learn! Links & Info! Contact:! o  E-mail: [email protected]! o  Website: http://sebastianraschka.com! o  Twitter: @rasbt! o  GitHub: rasbt! Tutorial Material on GitHub:! https://github.com/rasbt/pydata-chicago2016-ml-tutorial! 2!

Slide 3

Slide 3 text

Sebastian Raschka, Learning scikit-learn! Let’s Not Stress! ! 3! This is an introductory tutorial, and we are here to learn! ! Please ask questions!!

Slide 4

Slide 4 text

Sebastian Raschka, Learning scikit-learn! What can Machine Learning do for us?! 4! https://commons.wikimedia.org/wiki/ File:Google_self_driving_car_at_the_Googleplex.jpg! Photo by Michasel Shick, CC BY-SA 4.0 Iit! h"ps://flic.kr/p/5BLW6G [CC BY 2.0]

Slide 5

Slide 5 text

Sebastian Raschka, Learning scikit-learn! What is Machine Learning?! 5! Outputs ! (labels)! Inputs ! (observations)! Computer! Program! Spam/Non-Spam! Labels! Emails! Classification Algorithm! Spam Filter!

Slide 6

Slide 6 text

Sebastian Raschka, Learning scikit-learn! 3 Types of Learning! 6! Reinforcement! Supervised! Unsupervised!

Slide 7

Slide 7 text

Sebastian Raschka, Learning scikit-learn! Working with Labeled Data! 7! Supervised Learning! ?! x (“input”) y (“output”) x1 (“input”) x2 (“input”) ?! Regression Classification

Slide 8

Slide 8 text

Sebastian Raschka, Learning scikit-learn! Working with Unlabeled Data! 8! Unsupervised Learning! Clustering Compression

Slide 9

Slide 9 text

Sebastian Raschka, Learning scikit-learn! Topics! 9! 1.  Introduction to Machine Learning! 2.  Linear Regression 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning!

Slide 10

Slide 10 text

Sebastian Raschka, Learning scikit-learn! Simple Linear Regression! 10! y (response variable) x (explanatory variable) (xi , yi ) ŷ = w0 + w1 x w0 (intercept) w1 (slope) = Δy / Δx Δx Δy verMcal offset |ŷ − y|

Slide 11

Slide 11 text

Sebastian Raschka, Learning scikit-learn! 11! features (columns)! x0 x1 … xm x0,0! x0,1! x1,0! x1,1! x2,0! x2,1! x3,0! x3,1! .! .! .! xn,0! xn,1! …! xn,m! X=! y0! y1! y2! y3! .! .! .! yn! y=! Data Representation! samples (rows)!

Slide 12

Slide 12 text

Sebastian Raschka, Learning scikit-learn! Learning Algorithm Hyperparameter Values Model Prediction Test Labels Performance Model Learning Algorithm Hyperparameter Values Final Model 2 3 4 1 Test Labels Test Data Training Data Training Labels Data Labels Data Labels Training Data Training Labels Test Data “Basic” Supervised Learning Workflow! 12!

Slide 13

Slide 13 text

Sebastian Raschka, Learning scikit-learn! Code Examples! 13! Jupyter Notebook!

Slide 14

Slide 14 text

Sebastian Raschka, Learning scikit-learn! Topics! 14! 1.  Introduction to Machine Learning! 2.  Linear Regression! 3.  Introduction to Classification 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning!

Slide 15

Slide 15 text

Sebastian Raschka, Learning scikit-learn! Scikit-learn API! 15! class SupervisedEstimator(...): def __init__(self, hyperparam, ...): ... def fit(self, X, y): ... return self def predict(self, X): ... return y_pred def score(self, X, y): ... return score ...

Slide 16

Slide 16 text

Sebastian Raschka, Learning scikit-learn! Iris Dataset! 16! Iris-Versicolor! Iris-Setosa! Iris-Setosa!

Slide 17

Slide 17 text

Sebastian Raschka, Learning scikit-learn! 17! features (columns)! sepal length [cm] ! sepal width [cm] ! petal length [cm] ! petal width [cm] ! 1! 5.1 ! 3.5 ! 1.4! 0.2! 2! 4.9 ! 3.0 ! 1.4! 0.2! 50! 6.4 ! 3.5 ! 4.5! 1.2! .! .! .! 150! 5.9 ! 3.0 ! 5.0! 1.8 ! X=! setosa ! setosa ! versicolor ! .! .! .! virginica ! y=! Iris Dataset! samples (rows)! sepal ! petal !

Slide 18

Slide 18 text

Sebastian Raschka, Learning scikit-learn! Note about Non-Stratified Splits! 18! §  training set  38 x Setosa, 28 x Versicolor, 34 x Virginica! §  test set  12 x Setosa, 22 x Versicolor, 16 x Virginica!

Slide 19

Slide 19 text

Sebastian Raschka, Learning scikit-learn! 19! Linear Regression Recap! Σ .! .! .! w1! wm! w2! w0! x1! 1 ! x2! xm! !y! Activation function ! Net input function! a! z! Unit step function! Predicted output! Weight coefficients! Input values! Bias unit!

Slide 20

Slide 20 text

Sebastian Raschka, Learning scikit-learn! Logistic Regression, a Generalized Linear Model! 20! Σ .! .! .! w1! wm! w2! w0! x1! 1 ! x2! xm! !y! Activation function ! Net input function! a! z! Unit step function! Predicted class label! Weight coefficients! Input values! Bias unit! Predicted probability!

Slide 21

Slide 21 text

Sebastian Raschka, Learning scikit-learn! A “Lazy Learner:” K-Nearest Neighbors Classifier! 21! x1! ?! 3 ×! 1 ×! 1 ×! Predict ! ?! = ! x2!

Slide 22

Slide 22 text

Sebastian Raschka, Learning scikit-learn! Code Examples! 22! Jupyter Notebook!

Slide 23

Slide 23 text

Sebastian Raschka, Learning scikit-learn! Topics! 23! 1.  Introduction to Machine Learning! 2.  Linear Regression! 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning!

Slide 24

Slide 24 text

Sebastian Raschka, Learning scikit-learn! Categorical Variables! 24! color! size! price! class label! red! M! $10.49! 0! blue! XL! $15.00! 1! green! L! $12.99! 1!

Slide 25

Slide 25 text

Sebastian Raschka, Learning scikit-learn! Encoding Categorical Variables! 25! color! size! price! class label! red! M! $10.49! 0! blue! XL! $15.00! 1! green! L! $12.99! 1! size! 0! 2! 1! red! blue! green! 1! 0! 0! 0! 1! 0! 0! 0! 1!

Slide 26

Slide 26 text

Sebastian Raschka, Learning scikit-learn! Feature Normalization! 26! feature! minmax! z-score! 1.0! 0.0! -1.46385! 2.0! 0.2! -0.87831! 3.0! 0.4! -0.29277! 4.0! 0.6! 0.29277! 5.0! 0.8! 0.87831! 6.0! 1.0! 1.46385! Min-max scaling! Z-score standardization!

Slide 27

Slide 27 text

Sebastian Raschka, Learning scikit-learn! Scikit-learn API! 27! class UnsupervisedEstimator(...): def __init__(self, ...): ... def fit(self, X): ... return self def transform(self, X): ... return X_transf def predict(self, X): ... return pred

Slide 28

Slide 28 text

Sebastian Raschka, Learning scikit-learn! Scikit-learn Pipelines! 28! Class labels! Training data! Test data! Learning Algorithm! Dimensionality Reduction! Scaling! Model! Pipeline! fit fit & transform fit & transform fit transform transform Class labels! predict predict

Slide 29

Slide 29 text

Sebastian Raschka, Learning scikit-learn! Code Examples! 29! Jupyter Notebook!

Slide 30

Slide 30 text

Sebastian Raschka, Learning scikit-learn! Topics! 30! 1.  Introduction to Machine Learning! 2.  Linear Regression! 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction 6.  Model Evaluation & Hyperparameter Tuning!

Slide 31

Slide 31 text

Sebastian Raschka, Learning scikit-learn! Dimensionality Reduction – why?! 31! [cm] [cm] [cm] [cm] [cm] [cm] [cm] [cm]

Slide 32

Slide 32 text

Sebastian Raschka, Learning scikit-learn! Dimensionality Reduction – why?! 32! predictive performance! predictive performance! storage & speed! visualization & interpretability!

Slide 33

Slide 33 text

Sebastian Raschka, Learning scikit-learn! Recursive Feature Elimination! 33! available features:! [ w1 w2 w3 w4 ] [ w1 w2 w4 ] [ w1 w4 ] [ w4 ] [ f1 f2 f3 f4 ] fit model, remove lowest weight, repeat! fit model, remove lowest weight, repeat! fit model, remove lowest weight, repeat!

Slide 34

Slide 34 text

Sebastian Raschka, Learning scikit-learn! Sequential Feature Selection! 34! [ f1 f2 f3 f4 ] [ f1 ] [ f2 ] [ f3 ] [ f4 ] [ f3 f1 ] [ f3 f2 ] [ f3 f4 ] [ f3 f2 f1 ] [ f3 f2 f4 ] available features:! fit model, pick best, repeat! fit model, pick best, repeat!

Slide 35

Slide 35 text

Sebastian Raschka, Learning scikit-learn! Principal Component Analysis! 35! x1! x2! PC1! PC2!

Slide 36

Slide 36 text

Sebastian Raschka, Learning scikit-learn! Code Examples! 36! Jupyter Notebook!

Slide 37

Slide 37 text

Sebastian Raschka, Learning scikit-learn! Topics! 37! 1.  Introduction to Machine Learning! 2.  Linear Regression! 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning

Slide 38

Slide 38 text

Sebastian Raschka, Learning scikit-learn! Learning Algorithm Hyperparameter Values Model Prediction Test Labels Performance Model Learning Algorithm Hyperparameter Values Final Model 2 3 4 1 Test Labels Test Data Training Data Training Labels Data Labels Data Labels Training Data Training Labels Test Data “Basic” Supervised Learning Workflow! 38!

Slide 39

Slide 39 text

Sebastian Raschka, Learning scikit-learn! Bias and Variance! 39!

Slide 40

Slide 40 text

Sebastian Raschka, Learning scikit-learn! Learning Curves! 40!

Slide 41

Slide 41 text

Sebastian Raschka, Learning scikit-learn! Repeated Holdout! 41!

Slide 42

Slide 42 text

Sebastian Raschka, Learning scikit-learn! Holdout and Hyperparameter Tuning I! 42! 2 1 Data Labels Training Data Validation Data Validation Labels Test Data Test Labels Training Labels Performance Model Validation Data Validation Labels Prediction Performance Model Validation Data Validation Labels Prediction Performance Model Validation Data Validation Labels Prediction Best Model Learning Algorithm Hyperparameter values Model Hyperparameter values Hyperparameter values Model Model Training Data Training Labels 3 Best Hyperparameter values

Slide 43

Slide 43 text

Sebastian Raschka, Learning scikit-learn! Holdout and Hyperparameter Tuning II! 43! Learning Algorithm Best Hyperparameter Values Final Model 6 Data Labels Prediction Test Labels Performance Model 4 Test Data Learning Algorithm Best Hyperparameter Values Model Training Data Training Labels 5 Validation Data Validation Labels

Slide 44

Slide 44 text

Sebastian Raschka, Learning scikit-learn! 1st 2nd 3rd 4th 5th K Iterations (K-Folds) Validation Fold Training Fold Learning Algorithm Hyperparameter Values Model Training Fold Data Training Fold Labels Prediction Performance Model Validation Fold Data Validation Fold Labels Performance Performance Performance Performance Performance 1 2 3 4 5 Performance 1 10 ∑ 10 i=1 Performancei = This work by Sebastian Raschka is licensed under a Creative Commons Attribution 4.0 International License. K-fold Cross-Validation! 44!

Slide 45

Slide 45 text

Sebastian Raschka, Learning scikit-learn! 45! K-fold Cross-Validation Workflow I! Test Labels Test Data Training Data Training Labels Data Labels Model Model Model Learning Algorithm Hyperparameter values Hyperparameter values Hyperparameter values Training Data Training Labels Learning Algorithm Best Hyperparameter Values Model Training Data Training Labels 2 1 3

Slide 46

Slide 46 text

Sebastian Raschka, Learning scikit-learn! 46! K-fold Cross-Validation Workflow II! Learning Algorithm Best Hyperparameter Values Model Training Data Training Labels Prediction Test Labels Performance Model Test Data Learning Algorithm Best Hyperparameter Values Final Model Data Labels 3 4 5

Slide 47

Slide 47 text

Sebastian Raschka, Learning scikit-learn! Code Examples! 47! Jupyter Notebook!

Slide 48

Slide 48 text

Sebastian Raschka, Learning scikit-learn! Performance Metrics! 48! http://scikit-learn.org/stable/modules/model_evaluation.html! !

Slide 49

Slide 49 text

Sebastian Raschka, Learning scikit-learn! Further Resources! 50! Documentation:! http://scikit-learn.org! Mailing list:! https://mail.python.org/mailman/listinfo/scikit-learn!

Slide 50

Slide 50 text

Sebastian Raschka, Learning scikit-learn! 51! http://scikit-learn.org/stable/tutorial/machine_learning_map/! ! Further Resources! Andreas’ “cheat sheet”!

Slide 51

Slide 51 text

Sebastian Raschka, Learning scikit-learn! Further Resources! 52! Great “math-free,” practical guide to machine learning with scikit-learn! ! ! ! By Andreas Mueller (scikit-learn core developer)! and Sarah Guido! ! http://shop.oreilly.com/product/0636920030515.do! Estimated release: October 20, 2016!

Slide 52

Slide 52 text

Sebastian Raschka, Learning scikit-learn! 53! Further Resources! My favorite machine learning “math & theory” books! h"p://statweb.stanford.edu/~Mbs/ ElemStatLearn/ (free PDF) h"p://www.wiley.com/WileyCDA/WileyTitle/ productCd-0471056693.html

Slide 53

Slide 53 text

Sebastian Raschka, Learning scikit-learn! My own book.! Some math, ! “from-scratch” code, ! and practical scikit-learn examples! 54! Further Resources! https://github.com/rasbt/python- machine-learning-book! h"ps://www.amazon.com/Python-Machine- Learning-SebasMan-Raschka/dp/1783555130/ GitHub repository:! Amazon link:!

Slide 54

Slide 54 text

Sebastian Raschka, Learning scikit-learn! Thanks for attending!! Contact:! o  E-mail: [email protected]! o  Website: http://sebastianraschka.com! o  Twitter: @rasbt! o  GitHub: rasbt! Link to the material:! https://github.com/rasbt/pydata-chicago2016-ml-tutorial! 55!