Machine Learning With Scikit-Learn ODSC SF 2015

Machine Learning With Scikit-Learn ODSC SF 2015

Introduction to machine learning with scikit-learn. Material at https://github.com/amueller/odscon-sf-2015

8ffe68e4b19092aab184e4aa09ca4bff?s=128

Andreas Mueller

November 15, 2015
Tweet

Transcript

  1. Machine Learning with Scikit-Learn Andreas Mueller (NYU Center for Data

    Science, scikit-learn) Material: http://bit.ly/sklsf
  2. 2 Me

  3. 3 Classification Regression Clustering Semi-Supervised Learning Feature Selection Feature Extraction

    Manifold Learning Dimensionality Reduction Kernel Approximation Hyperparameter Optimization Evaluation Metrics Out-of-core learning …...
  4. 4

  5. 5 Get the notebooks! http://bit.ly/sklsf

  6. 6 http://scikit-learn.org/

  7. 7 Hi Andy, I just received an email from the

    first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks
  8. 8 Hi Andy, I just received an email from the

    first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks
  9. 9 Doing Machine Learning With Scikit-Learn

  10. 10 Representing Data X = 1.1 2.2 3.4 5.6 1.0

    6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3
  11. 11 Representing Data X = 1.1 2.2 3.4 5.6 1.0

    6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample
  12. 12 Representing Data X = 1.1 2.2 3.4 5.6 1.0

    6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample one feature
  13. 13 Representing Data X = y = 1.1 2.2 3.4

    5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 1.6 2.7 4.4 0.5 0.2 5.6 6.7 one sample one feature outputs / labels
  14. 14 Training and Testing Data X = 1.1 2.2 3.4

    5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7
  15. 15 Training and Testing Data X = 1.1 2.2 3.4

    5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7 training set test set
  16. 16 Training and Testing Data X = 1.1 2.2 3.4

    5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7 training set test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y)
  17. 17 Supervised Machine Learning Training Data Training Labels Model

  18. 18 Supervised Machine Learning Training Data Test Data Training Labels

    Model Prediction
  19. 19 Supervised Machine Learning Training Data Test Data Training Labels

    Model Prediction Test Labels Evaluation
  20. 20 Supervised Machine Learning Training Data Test Data Training Labels

    Model Prediction Test Labels Evaluation Training Generalization
  21. 21 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Training Labels

    Model
  22. 22 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Test Data

    Training Labels Model Prediction y_pred = clf.predict(X_test)
  23. 23 clf = RandomForestClassifier() clf.fit(X_train, y_train) clf.score(X_test, y_test) Training Data

    Test Data Training Labels Model Prediction Test Labels Evaluation y_pred = clf.predict(X_test)
  24. 24 IPython Notebook: Part 1 - Introduction to Scikit-learn

  25. 25 Unsupervised Machine Learning Training Data Model

  26. 26 Unsupervised Machine Learning Training Data Test Data Model New

    View
  27. 27 pca = PCA() pca.fit(X_train) X_new = pca.transform(X_test) Training Data

    Test Data Model Transformation Unsupervised Transformations
  28. 28 IPython Notebook: Part 2 – Unsupervised Transformers

  29. 29 Basic API estimator.fit(X, [y]) estimator.predict estimator.transform Classification Preprocessing Regression

    Dimensionality reduction Clustering Feature selection Feature extraction
  30. 30 All Data Training data Test data

  31. 31 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5
  32. 32 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1
  33. 33 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2
  34. 34 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2 Split 3 Split 4 Split 5
  35. 35 IPython Notebook: Part 3 - Cross-validation

  36. 36

  37. 37

  38. 38 All Data Training data Test data

  39. 39 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Split 1 Split 2 Split 3 Split 4 Split 5
  40. 40 All Data Training data Test data Fold 1 Fold

    2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Finding Parameters Final evaluation Split 1 Split 2 Split 3 Split 4 Split 5
  41. 41 SVC(C=0.001, gamma=0.001)

  42. 42 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10,

    gamma=0.001)
  43. 43 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10,

    gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01)
  44. 44 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10,

    gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1)
  45. 45 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10,

    gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1) SVC(C=0.001, gamma=1) SVC(C=0.01, gamma=1) SVC(C=0.1, gamma=1) SVC(C=1, gamma=1) SVC(C=10, gamma=1) SVC(C=0.001, gamma=10) SVC(C=0.01, gamma=10) SVC(C=0.1, gamma=10) SVC(C=1, gamma=10) SVC(C=10, gamma=10)
  46. 46 IPython Notebook: Part 4 – Grid Searches

  47. 47 Training Data Training Labels Model

  48. 48 Training Data Training Labels Model

  49. 49 Training Data Training Labels Model Feature Extraction

  50. 50 Training Data Training Labels Model Feature Extraction Scaling

  51. 51 Training Data Training Labels Model Feature Extraction Scaling Feature

    Selection
  52. 52 Training Data Training Labels Model Feature Extraction Scaling Feature

    Selection Cross Validation
  53. 53 Training Data Training Labels Model Feature Extraction Scaling Feature

    Selection Cross Validation
  54. 54 Pipelines

  55. 55 Pipelines

  56. 56 IPython Notebook: Part 5 - Preprocessing and Pipelines

  57. 57 Do cross-validation over all steps jointly. Keep a separate

    test set until the very end.
  58. 58 Sample application: Sentiment Analysis

  59. 59 Review: One of the worst movies I've ever rented.

    Sorry it had one of my favorite actors on it (Travolta) in a nonsense role. In fact, anything made sense in this movie. Who can say there was true love between Eddy and Maureen? Don't you remember the beginning of the movie ? Is she so lovely? Ask her daughters. I don't think so. Label: negative Training data: 12500 positive, 12500 negative IMDB Movie Reviews Data
  60. 60 Bag Of Word Representations CountVectorizer / TfidfVectorizer

  61. 61 Bag Of Word Representations “This is how you get

    ants.” CountVectorizer / TfidfVectorizer
  62. 62 Bag Of Word Representations “This is how you get

    ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer
  63. 63 Bag Of Word Representations “This is how you get

    ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']
  64. 64 Bag Of Word Representations “This is how you get

    ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']
  65. 65 IPython Notebook: Part 6 - Working With Text Data

  66. 66 Feature Union Training Data Training Labels Model Feature Extraction

    I Feature Extraction II
  67. 67 IPython Notebook: Part 7 – FeatureUnion

  68. 68 Overfitting and Underfitting Model complexity Accuracy Training

  69. 69 Overfitting and Underfitting Model complexity Accuracy Training Generalization

  70. 70 Overfitting and Underfitting Model complexity Accuracy Training Generalization Underfitting

    Overfitting Sweet spot
  71. 71 Linear SVM

  72. 72 Linear SVM

  73. 73 (RBF) Kernel SVM

  74. 74 (RBF) Kernel SVM

  75. 75 (RBF) Kernel SVM

  76. 76 (RBF) Kernel SVM

  77. 77 Decision Trees

  78. 78 Decision Trees

  79. 79 Decision Trees

  80. 80 Decision Trees

  81. 81 Decision Trees

  82. 82 Decision Trees

  83. 83 Random Forests

  84. 84 Random Forests

  85. 85 Random Forests

  86. 86 Validation Curves train_scores, test_scores = validation_curve(SVC(), X, y, param_name="gamma",

    param_range=param_range)
  87. Andreas Mueller 87 Scaling Up

  88. Andreas Mueller 88 Three regimes of data • Fits in

    RAM • Fits on a Hard Drive • Doesn't fit on a single PC
  89. Andreas Mueller 89 Three regimes of data • Fits in

    RAM (up to 256 GB?) • Fits on a Hard Drive (up to 6TB?) • Doesn't fit on a single PC
  90. Andreas Mueller 90

  91. Andreas Mueller 91

  92. Andreas Mueller 92 "256Gb ought to be enough for anybody."

    - me
  93. Andreas Mueller 93 "256Gb ought to be enough for anybody."

    - me (for machine learning)
  94. Andreas Mueller 94 Subsample!

  95. Andreas Mueller 95 The scikit-learn way

  96. Andreas Mueller 96 HDD Network estimator.partial_fit(X_batch, y_batch) Your for-loop /

    polling Trained Scikit-learn estimator
  97. 97 Supported Algorithms • All SGDClassifier derivatives • Naive Bayes

    • MinibatchKMeans • Birch • IncrementalPCA • MiniBatchDictionaryLearning
  98. 98 IPython Notebook: Part 8 – Out Of Core Learning

  99. 99 Stateless Transformers • Normalizer • HashingVectorizer • RBFSampler (and

    other kernel approx)
  100. 100 Bag Of Word Representations “This is how you get

    ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']
  101. 101 Hashing Trick “This is how you get ants.” [0,

    …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] HashingVectorizer tokenizer Sparse matrix encoding hashing [hash('this'), hash('is'), hash('how'), hash('you'), hash('get'), hash('ants')] = [832412, 223788, 366226, 81185, 835749, 173092]
  102. 102 IPython Notebook: Part 9 – Out Of Core Learning

    for Text
  103. 103 Video Series Advanced Machine Learning with scikit-learn 50% Off

    Coupon Code: AUTHD
  104. 104 Video Series Advanced Machine Learning with scikit-learn 50% Off

    Coupon Code: AUTHD
  105. 105 Thank you for your attention. @t3kcit @amueller importamueller@gmail.com http://amueller.github.io