Slide 1

Slide 1 text

Machine Learning with Scikit-Learn Andreas Mueller (NYU Center for Data Science, scikit-learn) Material: http://bit.ly/sklsf

Slide 2

Slide 2 text

2 Me

Slide 3

Slide 3 text

3 Classification Regression Clustering Semi-Supervised Learning Feature Selection Feature Extraction Manifold Learning Dimensionality Reduction Kernel Approximation Hyperparameter Optimization Evaluation Metrics Out-of-core learning …...

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5 Get the notebooks! http://bit.ly/sklsf

Slide 6

Slide 6 text

6 http://scikit-learn.org/

Slide 7

Slide 7 text

7 Hi Andy, I just received an email from the first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks

Slide 8

Slide 8 text

8 Hi Andy, I just received an email from the first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks

Slide 9

Slide 9 text

9 Doing Machine Learning With Scikit-Learn

Slide 10

Slide 10 text

10 Representing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3

Slide 11

Slide 11 text

11 Representing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample

Slide 12

Slide 12 text

12 Representing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample one feature

Slide 13

Slide 13 text

13 Representing Data X = y = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 1.6 2.7 4.4 0.5 0.2 5.6 6.7 one sample one feature outputs / labels

Slide 14

Slide 14 text

14 Training and Testing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7

Slide 15

Slide 15 text

15 Training and Testing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7 training set test set

Slide 16

Slide 16 text

16 Training and Testing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7 training set test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y)

Slide 17

Slide 17 text

17 Supervised Machine Learning Training Data Training Labels Model

Slide 18

Slide 18 text

18 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction

Slide 19

Slide 19 text

19 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction Test Labels Evaluation

Slide 20

Slide 20 text

20 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction Test Labels Evaluation Training Generalization

Slide 21

Slide 21 text

21 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Training Labels Model

Slide 22

Slide 22 text

22 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Test Data Training Labels Model Prediction y_pred = clf.predict(X_test)

Slide 23

Slide 23 text

23 clf = RandomForestClassifier() clf.fit(X_train, y_train) clf.score(X_test, y_test) Training Data Test Data Training Labels Model Prediction Test Labels Evaluation y_pred = clf.predict(X_test)

Slide 24

Slide 24 text

24 IPython Notebook: Part 1 - Introduction to Scikit-learn

Slide 25

Slide 25 text

25 Unsupervised Machine Learning Training Data Model

Slide 26

Slide 26 text

26 Unsupervised Machine Learning Training Data Test Data Model New View

Slide 27

Slide 27 text

27 pca = PCA() pca.fit(X_train) X_new = pca.transform(X_test) Training Data Test Data Model Transformation Unsupervised Transformations

Slide 28

Slide 28 text

28 IPython Notebook: Part 2 – Unsupervised Transformers

Slide 29

Slide 29 text

29 Basic API estimator.fit(X, [y]) estimator.predict estimator.transform Classification Preprocessing Regression Dimensionality reduction Clustering Feature selection Feature extraction

Slide 30

Slide 30 text

30 All Data Training data Test data

Slide 31

Slide 31 text

31 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Slide 32

Slide 32 text

32 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1

Slide 33

Slide 33 text

33 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2

Slide 34

Slide 34 text

34 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2 Split 3 Split 4 Split 5

Slide 35

Slide 35 text

35 IPython Notebook: Part 3 - Cross-validation

Slide 36

Slide 36 text

36

Slide 37

Slide 37 text

37

Slide 38

Slide 38 text

38 All Data Training data Test data

Slide 39

Slide 39 text

39 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Split 1 Split 2 Split 3 Split 4 Split 5

Slide 40

Slide 40 text

40 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Finding Parameters Final evaluation Split 1 Split 2 Split 3 Split 4 Split 5

Slide 41

Slide 41 text

41 SVC(C=0.001, gamma=0.001)

Slide 42

Slide 42 text

42 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001)

Slide 43

Slide 43 text

43 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01)

Slide 44

Slide 44 text

44 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1)

Slide 45

Slide 45 text

45 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1) SVC(C=0.001, gamma=1) SVC(C=0.01, gamma=1) SVC(C=0.1, gamma=1) SVC(C=1, gamma=1) SVC(C=10, gamma=1) SVC(C=0.001, gamma=10) SVC(C=0.01, gamma=10) SVC(C=0.1, gamma=10) SVC(C=1, gamma=10) SVC(C=10, gamma=10)

Slide 46

Slide 46 text

46 IPython Notebook: Part 4 – Grid Searches

Slide 47

Slide 47 text

47 Training Data Training Labels Model

Slide 48

Slide 48 text

48 Training Data Training Labels Model

Slide 49

Slide 49 text

49 Training Data Training Labels Model Feature Extraction

Slide 50

Slide 50 text

50 Training Data Training Labels Model Feature Extraction Scaling

Slide 51

Slide 51 text

51 Training Data Training Labels Model Feature Extraction Scaling Feature Selection

Slide 52

Slide 52 text

52 Training Data Training Labels Model Feature Extraction Scaling Feature Selection Cross Validation

Slide 53

Slide 53 text

53 Training Data Training Labels Model Feature Extraction Scaling Feature Selection Cross Validation

Slide 54

Slide 54 text

54 Pipelines

Slide 55

Slide 55 text

55 Pipelines

Slide 56

Slide 56 text

56 IPython Notebook: Part 5 - Preprocessing and Pipelines

Slide 57

Slide 57 text

57 Do cross-validation over all steps jointly. Keep a separate test set until the very end.

Slide 58

Slide 58 text

58 Sample application: Sentiment Analysis

Slide 59

Slide 59 text

59 Review: One of the worst movies I've ever rented. Sorry it had one of my favorite actors on it (Travolta) in a nonsense role. In fact, anything made sense in this movie. Who can say there was true love between Eddy and Maureen? Don't you remember the beginning of the movie ? Is she so lovely? Ask her daughters. I don't think so. Label: negative Training data: 12500 positive, 12500 negative IMDB Movie Reviews Data

Slide 60

Slide 60 text

60 Bag Of Word Representations CountVectorizer / TfidfVectorizer

Slide 61

Slide 61 text

61 Bag Of Word Representations “This is how you get ants.” CountVectorizer / TfidfVectorizer

Slide 62

Slide 62 text

62 Bag Of Word Representations “This is how you get ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer

Slide 63

Slide 63 text

63 Bag Of Word Representations “This is how you get ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 64

Slide 64 text

64 Bag Of Word Representations “This is how you get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 65

Slide 65 text

65 IPython Notebook: Part 6 - Working With Text Data

Slide 66

Slide 66 text

66 Feature Union Training Data Training Labels Model Feature Extraction I Feature Extraction II

Slide 67

Slide 67 text

67 IPython Notebook: Part 7 – FeatureUnion

Slide 68

Slide 68 text

68 Overfitting and Underfitting Model complexity Accuracy Training

Slide 69

Slide 69 text

69 Overfitting and Underfitting Model complexity Accuracy Training Generalization

Slide 70

Slide 70 text

70 Overfitting and Underfitting Model complexity Accuracy Training Generalization Underfitting Overfitting Sweet spot

Slide 71

Slide 71 text

71 Linear SVM

Slide 72

Slide 72 text

72 Linear SVM

Slide 73

Slide 73 text

73 (RBF) Kernel SVM

Slide 74

Slide 74 text

74 (RBF) Kernel SVM

Slide 75

Slide 75 text

75 (RBF) Kernel SVM

Slide 76

Slide 76 text

76 (RBF) Kernel SVM

Slide 77

Slide 77 text

77 Decision Trees

Slide 78

Slide 78 text

78 Decision Trees

Slide 79

Slide 79 text

79 Decision Trees

Slide 80

Slide 80 text

80 Decision Trees

Slide 81

Slide 81 text

81 Decision Trees

Slide 82

Slide 82 text

82 Decision Trees

Slide 83

Slide 83 text

83 Random Forests

Slide 84

Slide 84 text

84 Random Forests

Slide 85

Slide 85 text

85 Random Forests

Slide 86

Slide 86 text

86 Validation Curves train_scores, test_scores = validation_curve(SVC(), X, y, param_name="gamma", param_range=param_range)

Slide 87

Slide 87 text

Andreas Mueller 87 Scaling Up

Slide 88

Slide 88 text

Andreas Mueller 88 Three regimes of data ● Fits in RAM ● Fits on a Hard Drive ● Doesn't fit on a single PC

Slide 89

Slide 89 text

Andreas Mueller 89 Three regimes of data ● Fits in RAM (up to 256 GB?) ● Fits on a Hard Drive (up to 6TB?) ● Doesn't fit on a single PC

Slide 90

Slide 90 text

Andreas Mueller 90

Slide 91

Slide 91 text

Andreas Mueller 91

Slide 92

Slide 92 text

Andreas Mueller 92 "256Gb ought to be enough for anybody." - me

Slide 93

Slide 93 text

Andreas Mueller 93 "256Gb ought to be enough for anybody." - me (for machine learning)

Slide 94

Slide 94 text

Andreas Mueller 94 Subsample!

Slide 95

Slide 95 text

Andreas Mueller 95 The scikit-learn way

Slide 96

Slide 96 text

Andreas Mueller 96 HDD Network estimator.partial_fit(X_batch, y_batch) Your for-loop / polling Trained Scikit-learn estimator

Slide 97

Slide 97 text

97 Supported Algorithms ● All SGDClassifier derivatives ● Naive Bayes ● MinibatchKMeans ● Birch ● IncrementalPCA ● MiniBatchDictionaryLearning

Slide 98

Slide 98 text

98 IPython Notebook: Part 8 – Out Of Core Learning

Slide 99

Slide 99 text

99 Stateless Transformers ● Normalizer ● HashingVectorizer ● RBFSampler (and other kernel approx)

Slide 100

Slide 100 text

100 Bag Of Word Representations “This is how you get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 101

Slide 101 text

101 Hashing Trick “This is how you get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] HashingVectorizer tokenizer Sparse matrix encoding hashing [hash('this'), hash('is'), hash('how'), hash('you'), hash('get'), hash('ants')] = [832412, 223788, 366226, 81185, 835749, 173092]

Slide 102

Slide 102 text

102 IPython Notebook: Part 9 – Out Of Core Learning for Text

Slide 103

Slide 103 text

103 Video Series Advanced Machine Learning with scikit-learn 50% Off Coupon Code: AUTHD

Slide 104

Slide 104 text

104 Video Series Advanced Machine Learning with scikit-learn 50% Off Coupon Code: AUTHD

Slide 105

Slide 105 text

105 Thank you for your attention. @t3kcit @amueller [email protected] http://amueller.github.io