Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning scikit-learn -- An Introduction to Machine Learning in Python @ PyData Chicago 2016

Learning scikit-learn -- An Introduction to Machine Learning in Python @ PyData Chicago 2016

This tutorial will teach you the basics of scikit-learn. I will give you a brief overview of the basic concepts of classification and regression analysis, how to build powerful predictive models from labeled data. The accompanying GitHub repository can be found at https://github.com/rasbt/pydata-chicago2016-ml-tutorial.

Sebastian Raschka

August 24, 2016
Tweet

More Decks by Sebastian Raschka

Other Decks in Technology

Transcript

  1. Sebastian Raschka! PyData Chicago 2016! Chicago, The University of Illinois

    • August 26, 2016! Learning – An Introduction to Machine Learning in Python
  2. Sebastian Raschka, Learning scikit-learn! Links & Info! Contact:! o  E-mail:

    [email protected]! o  Website: http://sebastianraschka.com! o  Twitter: @rasbt! o  GitHub: rasbt! Tutorial Material on GitHub:! https://github.com/rasbt/pydata-chicago2016-ml-tutorial! 2!
  3. Sebastian Raschka, Learning scikit-learn! Let’s Not Stress! ! 3! This

    is an introductory tutorial, and we are here to learn! ! Please ask questions!!
  4. Sebastian Raschka, Learning scikit-learn! What can Machine Learning do for

    us?! 4! https://commons.wikimedia.org/wiki/ File:Google_self_driving_car_at_the_Googleplex.jpg! Photo by Michasel Shick, CC BY-SA 4.0 Iit! h"ps://flic.kr/p/5BLW6G [CC BY 2.0]
  5. Sebastian Raschka, Learning scikit-learn! What is Machine Learning?! 5! Outputs

    ! (labels)! Inputs ! (observations)! Computer! Program! Spam/Non-Spam! Labels! Emails! Classification Algorithm! Spam Filter!
  6. Sebastian Raschka, Learning scikit-learn! Working with Labeled Data! 7! Supervised

    Learning! ?! x (“input”) y (“output”) x1 (“input”) x2 (“input”) ?! Regression Classification
  7. Sebastian Raschka, Learning scikit-learn! Topics! 9! 1.  Introduction to Machine

    Learning! 2.  Linear Regression 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning!
  8. Sebastian Raschka, Learning scikit-learn! Simple Linear Regression! 10! y (response

    variable) x (explanatory variable) (xi , yi ) ŷ = w0 + w1 x w0 (intercept) w1 (slope) = Δy / Δx Δx Δy verMcal offset |ŷ − y|
  9. Sebastian Raschka, Learning scikit-learn! 11! features (columns)! x0 x1 …

    xm x0,0! x0,1! x1,0! x1,1! x2,0! x2,1! x3,0! x3,1! .! .! .! xn,0! xn,1! …! xn,m! X=! y0! y1! y2! y3! .! .! .! yn! y=! Data Representation! samples (rows)!
  10. Sebastian Raschka, Learning scikit-learn! Learning Algorithm Hyperparameter Values Model Prediction

    Test Labels Performance Model Learning Algorithm Hyperparameter Values Final Model 2 3 4 1 Test Labels Test Data Training Data Training Labels Data Labels Data Labels Training Data Training Labels Test Data “Basic” Supervised Learning Workflow! 12!
  11. Sebastian Raschka, Learning scikit-learn! Topics! 14! 1.  Introduction to Machine

    Learning! 2.  Linear Regression! 3.  Introduction to Classification 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning!
  12. Sebastian Raschka, Learning scikit-learn! Scikit-learn API! 15! class SupervisedEstimator(...): def

    __init__(self, hyperparam, ...): ... def fit(self, X, y): ... return self def predict(self, X): ... return y_pred def score(self, X, y): ... return score ...
  13. Sebastian Raschka, Learning scikit-learn! 17! features (columns)! sepal length [cm]

    ! sepal width [cm] ! petal length [cm] ! petal width [cm] ! 1! 5.1 ! 3.5 ! 1.4! 0.2! 2! 4.9 ! 3.0 ! 1.4! 0.2! 50! 6.4 ! 3.5 ! 4.5! 1.2! .! .! .! 150! 5.9 ! 3.0 ! 5.0! 1.8 ! X=! setosa ! setosa ! versicolor ! .! .! .! virginica ! y=! Iris Dataset! samples (rows)! sepal ! petal !
  14. Sebastian Raschka, Learning scikit-learn! Note about Non-Stratified Splits! 18! § 

    training set  38 x Setosa, 28 x Versicolor, 34 x Virginica! §  test set  12 x Setosa, 22 x Versicolor, 16 x Virginica!
  15. Sebastian Raschka, Learning scikit-learn! 19! Linear Regression Recap! Σ .!

    .! .! w1! wm! w2! w0! x1! 1 ! x2! xm! !y! Activation function ! Net input function! a! z! Unit step function! Predicted output! Weight coefficients! Input values! Bias unit!
  16. Sebastian Raschka, Learning scikit-learn! Logistic Regression, a Generalized Linear Model!

    20! Σ .! .! .! w1! wm! w2! w0! x1! 1 ! x2! xm! !y! Activation function ! Net input function! a! z! Unit step function! Predicted class label! Weight coefficients! Input values! Bias unit! Predicted probability!
  17. Sebastian Raschka, Learning scikit-learn! Topics! 23! 1.  Introduction to Machine

    Learning! 2.  Linear Regression! 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning!
  18. Sebastian Raschka, Learning scikit-learn! Categorical Variables! 24! color! size! price!

    class label! red! M! $10.49! 0! blue! XL! $15.00! 1! green! L! $12.99! 1!
  19. Sebastian Raschka, Learning scikit-learn! Encoding Categorical Variables! 25! color! size!

    price! class label! red! M! $10.49! 0! blue! XL! $15.00! 1! green! L! $12.99! 1! size! 0! 2! 1! red! blue! green! 1! 0! 0! 0! 1! 0! 0! 0! 1!
  20. Sebastian Raschka, Learning scikit-learn! Feature Normalization! 26! feature! minmax! z-score!

    1.0! 0.0! -1.46385! 2.0! 0.2! -0.87831! 3.0! 0.4! -0.29277! 4.0! 0.6! 0.29277! 5.0! 0.8! 0.87831! 6.0! 1.0! 1.46385! Min-max scaling! Z-score standardization!
  21. Sebastian Raschka, Learning scikit-learn! Scikit-learn API! 27! class UnsupervisedEstimator(...): def

    __init__(self, ...): ... def fit(self, X): ... return self def transform(self, X): ... return X_transf def predict(self, X): ... return pred
  22. Sebastian Raschka, Learning scikit-learn! Scikit-learn Pipelines! 28! Class labels! Training

    data! Test data! Learning Algorithm! Dimensionality Reduction! Scaling! Model! Pipeline! fit fit & transform fit & transform fit transform transform Class labels! predict predict
  23. Sebastian Raschka, Learning scikit-learn! Topics! 30! 1.  Introduction to Machine

    Learning! 2.  Linear Regression! 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction 6.  Model Evaluation & Hyperparameter Tuning!
  24. Sebastian Raschka, Learning scikit-learn! Dimensionality Reduction – why?! 32! predictive

    performance! predictive performance! storage & speed! visualization & interpretability!
  25. Sebastian Raschka, Learning scikit-learn! Recursive Feature Elimination! 33! available features:!

    [ w1 w2 w3 w4 ] [ w1 w2 w4 ] [ w1 w4 ] [ w4 ] [ f1 f2 f3 f4 ] fit model, remove lowest weight, repeat! fit model, remove lowest weight, repeat! fit model, remove lowest weight, repeat!
  26. Sebastian Raschka, Learning scikit-learn! Sequential Feature Selection! 34! [ f1

    f2 f3 f4 ] [ f1 ] [ f2 ] [ f3 ] [ f4 ] [ f3 f1 ] [ f3 f2 ] [ f3 f4 ] [ f3 f2 f1 ] [ f3 f2 f4 ] available features:! fit model, pick best, repeat! fit model, pick best, repeat!
  27. Sebastian Raschka, Learning scikit-learn! Topics! 37! 1.  Introduction to Machine

    Learning! 2.  Linear Regression! 3.  Introduction to Classification! 4.  Feature Preprocessing & scikit-learn Pipelines! 5.  Dimensionality Reduction: Feature Selection & Extraction! 6.  Model Evaluation & Hyperparameter Tuning
  28. Sebastian Raschka, Learning scikit-learn! Learning Algorithm Hyperparameter Values Model Prediction

    Test Labels Performance Model Learning Algorithm Hyperparameter Values Final Model 2 3 4 1 Test Labels Test Data Training Data Training Labels Data Labels Data Labels Training Data Training Labels Test Data “Basic” Supervised Learning Workflow! 38!
  29. Sebastian Raschka, Learning scikit-learn! Holdout and Hyperparameter Tuning I! 42!

    2 1 Data Labels Training Data Validation Data Validation Labels Test Data Test Labels Training Labels Performance Model Validation Data Validation Labels Prediction Performance Model Validation Data Validation Labels Prediction Performance Model Validation Data Validation Labels Prediction Best Model Learning Algorithm Hyperparameter values Model Hyperparameter values Hyperparameter values Model Model Training Data Training Labels 3 Best Hyperparameter values
  30. Sebastian Raschka, Learning scikit-learn! Holdout and Hyperparameter Tuning II! 43!

    Learning Algorithm Best Hyperparameter Values Final Model 6 Data Labels Prediction Test Labels Performance Model 4 Test Data Learning Algorithm Best Hyperparameter Values Model Training Data Training Labels 5 Validation Data Validation Labels
  31. Sebastian Raschka, Learning scikit-learn! 1st 2nd 3rd 4th 5th K

    Iterations (K-Folds) Validation Fold Training Fold Learning Algorithm Hyperparameter Values Model Training Fold Data Training Fold Labels Prediction Performance Model Validation Fold Data Validation Fold Labels Performance Performance Performance Performance Performance 1 2 3 4 5 Performance 1 10 ∑ 10 i=1 Performancei = This work by Sebastian Raschka is licensed under a Creative Commons Attribution 4.0 International License. K-fold Cross-Validation! 44!
  32. Sebastian Raschka, Learning scikit-learn! 45! K-fold Cross-Validation Workflow I! Test

    Labels Test Data Training Data Training Labels Data Labels Model Model Model Learning Algorithm Hyperparameter values Hyperparameter values Hyperparameter values Training Data Training Labels Learning Algorithm Best Hyperparameter Values Model Training Data Training Labels 2 1 3
  33. Sebastian Raschka, Learning scikit-learn! 46! K-fold Cross-Validation Workflow II! Learning

    Algorithm Best Hyperparameter Values Model Training Data Training Labels Prediction Test Labels Performance Model Test Data Learning Algorithm Best Hyperparameter Values Final Model Data Labels 3 4 5
  34. Sebastian Raschka, Learning scikit-learn! Further Resources! 52! Great “math-free,” practical

    guide to machine learning with scikit-learn! ! ! ! By Andreas Mueller (scikit-learn core developer)! and Sarah Guido! ! http://shop.oreilly.com/product/0636920030515.do! Estimated release: October 20, 2016!
  35. Sebastian Raschka, Learning scikit-learn! 53! Further Resources! My favorite machine

    learning “math & theory” books! h"p://statweb.stanford.edu/~Mbs/ ElemStatLearn/ (free PDF) h"p://www.wiley.com/WileyCDA/WileyTitle/ productCd-0471056693.html
  36. Sebastian Raschka, Learning scikit-learn! My own book.! Some math, !

    “from-scratch” code, ! and practical scikit-learn examples! 54! Further Resources! https://github.com/rasbt/python- machine-learning-book! h"ps://www.amazon.com/Python-Machine- Learning-SebasMan-Raschka/dp/1783555130/ GitHub repository:! Amazon link:!
  37. Sebastian Raschka, Learning scikit-learn! Thanks for attending!! Contact:! o  E-mail:

    [email protected]! o  Website: http://sebastianraschka.com! o  Twitter: @rasbt! o  GitHub: rasbt! Link to the material:! https://github.com/rasbt/pydata-chicago2016-ml-tutorial! 55!