Machine Learning With Scikit-Learn - Pydata Strata NYC 2015

Slide 1

Slide 1 text

Machine Learning with scikit-learn Andreas Mueller (NYU Center for Data Science, co-release manager scikit-learn)

Slide 2

Slide 2 text

2 Me

Slide 3

Slide 3 text

3 What is scikit-learn?

Slide 4

Slide 4 text

4 Classification Regression Clustering Semi-Supervised Learning Feature Selection Feature Extraction Manifold Learning Dimensionality Reduction Kernel Approximation Hyperparameter Optimization Evaluation Metrics Out-of-core learning …...

Slide 5

Slide 5 text

5 http://scikit-learn.org/

Slide 6

Slide 6 text

6 What is machine learning?

Slide 7

Slide 7 text

7 Hi Andy, I just received an email from the first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks

Slide 8

Slide 8 text

8 Hi Andy, I just received an email from the first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks

Slide 9

Slide 9 text

9 Doing Machine Learning With Scikit-Learn

Slide 10

Slide 10 text

10 Representing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3

Slide 11

Slide 11 text

11 Representing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample

Slide 12

Slide 12 text

12 Representing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 one sample one feature

Slide 13

Slide 13 text

13 Representing Data X = y = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 1.6 2.7 4.4 0.5 0.2 5.6 6.7 one sample one feature outputs / labels

Slide 14

Slide 14 text

14 Training and Testing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7

Slide 15

Slide 15 text

15 Training and Testing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7 training set test set

Slide 16

Slide 16 text

16 Training and Testing Data X = 1.1 2.2 3.4 5.6 1.0 6.7 0.5 0.4 2.6 1.6 2.4 9.3 7.3 6.4 2.8 1.5 0.0 4.3 8.3 3.4 0.5 3.5 8.1 3.6 4.6 5.1 9.7 3.5 7.9 5.1 3.7 7.8 2.6 3.2 6.3 y = 1.6 2.7 4.4 0.5 0.2 5.6 6.7 training set test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y)

Slide 17

Slide 17 text

17 Supervised Machine Learning Training Data Training Labels Model

Slide 18

Slide 18 text

18 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction

Slide 19

Slide 19 text

19 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction Test Labels Evaluation

Slide 20

Slide 20 text

20 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction Test Labels Evaluation Training Generalization

Slide 21

Slide 21 text

21 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Training Labels Model

Slide 22

Slide 22 text

22 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Test Data Training Labels Model Prediction y_pred = clf.predict(X_test)

Slide 23

Slide 23 text

23 clf = RandomForestClassifier() clf.fit(X_train, y_train) clf.score(X_test, y_test) Training Data Test Data Training Labels Model Prediction Test Labels Evaluation y_pred = clf.predict(X_test)

Slide 24

Slide 24 text

24 Unsupervised Machine Learning Training Data Model

Slide 25

Slide 25 text

25 Unsupervised Machine Learning Training Data Test Data Model New View

Slide 26

Slide 26 text

26 pca = PCA() pca.fit(X_train) X_new = pca.transform(X_test) Training Data Test Data Model Transformation Unsupervised Transformations

Slide 27

Slide 27 text

27 Basic API estimator.fit(X, [y]) estimator.predict estimator.transform Classification Preprocessing Regression Dimensionality reduction Clustering Feature selection Feature extraction

Slide 28

Slide 28 text

28 Sample application: Sentiment Analysis

Slide 29

Slide 29 text

29 Review: One of the worst movies I've ever rented. Sorry it had one of my favorite actors on it (Travolta) in a nonsense role. In fact, anything made sense in this movie. Who can say there was true love between Eddy and Maureen? Don't you remember the beginning of the movie ? Is she so lovely? Ask her daughters. I don't think so. Label: negative Training data: 12500 positive, 12500 negative IMDB Movie Reviews Data

Slide 30

Slide 30 text

30 Bag Of Word Representations CountVectorizer / TfidfVectorizer

Slide 31

Slide 31 text

31 Bag Of Word Representations “This is how you get ants.” CountVectorizer / TfidfVectorizer

Slide 32

Slide 32 text

32 Bag Of Word Representations “This is how you get ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer

Slide 33

Slide 33 text

33 Bag Of Word Representations “This is how you get ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 34

Slide 34 text

34 Bag Of Word Representations “This is how you get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 35

Slide 35 text

35 text_pipe = make_pipeline(CountVectorizer(), LinearSVC()) clf.fit(X_train, y_train) clf.score(X_test, y_test) > 0.85 Implementation and Results

Slide 36

Slide 36 text

36 text_pipe = make_pipeline(CountVectorizer(), LinearSVC()) clf.fit(X_train, y_train) clf.score(X_test, y_test) > 0.85 Implementation and Results

Slide 37

Slide 37 text

37 Model Complexity

Slide 38

Slide 38 text

38 Overfitting and Underfitting Model complexity Accuracy Training

Slide 39

Slide 39 text

39 Overfitting and Underfitting Model complexity Accuracy Training Generalization

Slide 40

Slide 40 text

40 Overfitting and Underfitting Model complexity Accuracy Training Generalization Underfitting Overfitting Sweet spot

Slide 41

Slide 41 text

41 Model Complexity Examples

Slide 42

Slide 42 text

42 Linear SVM

Slide 43

Slide 43 text

43 Linear SVM

Slide 44

Slide 44 text

44 (RBF) Kernel SVM

Slide 45

Slide 45 text

45 (RBF) Kernel SVM

Slide 46

Slide 46 text

46 (RBF) Kernel SVM

Slide 47

Slide 47 text

47 (RBF) Kernel SVM

Slide 48

Slide 48 text

48 Decision Trees

Slide 49

Slide 49 text

49 Decision Trees

Slide 50

Slide 50 text

50 Decision Trees

Slide 51

Slide 51 text

51 Decision Trees

Slide 52

Slide 52 text

52 Decision Trees

Slide 53

Slide 53 text

53 Decision Trees

Slide 54

Slide 54 text

54 Random Forests

Slide 55

Slide 55 text

55 Random Forests

Slide 56

Slide 56 text

56 Random Forests

Slide 57

Slide 57 text

57 Model Evaluation and Model Selection

Slide 58

Slide 58 text

58 All Data Training data Test data

Slide 59

Slide 59 text

59 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Slide 60

Slide 60 text

60 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1

Slide 61

Slide 61 text

61 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2

Slide 62

Slide 62 text

62 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2 Split 3 Split 4 Split 5

Slide 63

Slide 63 text

63 Cross-Validation from sklearn.cross_validation import cross_val_score scores = cross_val_score(SVC(), X, y, cv=5) print(scores) >> [ 0.92 1. 1. 1. 1. ]

Slide 64

Slide 64 text

64 SVC(C=0.001, gamma=0.001)

Slide 65

Slide 65 text

65 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001)

Slide 66

Slide 66 text

66 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01)

Slide 67

Slide 67 text

67 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1)

Slide 68

Slide 68 text

68 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1) SVC(C=0.001, gamma=1) SVC(C=0.01, gamma=1) SVC(C=0.1, gamma=1) SVC(C=1, gamma=1) SVC(C=10, gamma=1) SVC(C=0.001, gamma=10) SVC(C=0.01, gamma=10) SVC(C=0.1, gamma=10) SVC(C=1, gamma=10) SVC(C=10, gamma=10)

Slide 69

Slide 69 text

69 All Data Training data Test data

Slide 70

Slide 70 text

70 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Split 1 Split 2 Split 3 Split 4 Split 5

Slide 71

Slide 71 text

71 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Finding Parameters Final evaluation Split 1 Split 2 Split 3 Split 4 Split 5

Slide 72

Slide 72 text

72 Cross -Validated Grid Search from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) param_grid = {'C': 10. ** np.arange(-3, 3), 'gamma': 10. ** np.arange(-3, 3)} grid = GridSearchCV(SVC(), param_grid=param_grid) grid.fit(X_train, y_train) grid.score(X_test, y_test)

Slide 73

Slide 73 text

73 Pipelines

Slide 74

Slide 74 text

74 Training Data Training Labels Model

Slide 75

Slide 75 text

75 Training Data Training Labels Model

Slide 76

Slide 76 text

76 Training Data Training Labels Model Feature Extraction Scaling Feature Selection

Slide 77

Slide 77 text

77 Training Data Training Labels Model Feature Extraction Scaling Feature Selection Cross Validation and Parameter selection

Slide 78

Slide 78 text

78 Training Data Training Labels Model Feature Extraction Scaling Feature Selection Cross Validation and Parameter selection

Slide 79

Slide 79 text

79 Pipelines from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), SVC()) pipe.fit(X_train, y_train) pipe.predict(X_test)

Slide 80

Slide 80 text

80 Pipelines from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), SVC()) pipe.fit(X_train, y_train) pipe.predict(X_test)

Slide 81

Slide 81 text

81 Combining Pipelines and Grid Search Proper cross-validation param_grid = {'svc__C': 10. ** np.arange(-3, 3), 'svc__gamma': 10. ** np.arange(-3, 3)} scaler_pipe = make_pipeline(StandardScaler(), SVC()) grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5) grid.fit(X_train, y_train)

Slide 82

Slide 82 text

82 Combining Pipelines and Grid Search II Searching over parameters of the preprocessing step param_grid = {'selectkbest__k': [1, 2, 3, 4], 'svc__C': 10. ** np.arange(-3, 3), 'svc__gamma': 10. ** np.arange(-3, 3)} scaler_pipe = make_pipeline(SelectKBest(), SVC()) grid = GridSearchCV(scaler_pipe, param_grid=param_grid, cv=5) grid.fit(X_train, y_train)

Slide 83

Slide 83 text

83 Do cross-validation over all steps jointly. Keep a separate test set until the very end.

Slide 84

Slide 84 text

84 Scoring Functions

Slide 85

Slide 85 text

85 Default: Accuracy (classification) R2 (regression) GridSeachCV cross_val_score

Slide 86

Slide 86 text

86 Scoring with imbalanced data cross_val_score(SVC(), X_train, y_train) >>> array([ 0.9, 0.9, 0.9])

Slide 87

Slide 87 text

87 Scoring with imbalanced data cross_val_score(SVC(), X_train, y_train) >>> array([ 0.9, 0.9, 0.9]) cross_val_score(DummyClassifier("most_frequent"), X_train, y_train) >>> array([ 0.9, 0.9, 0.9])

Slide 88

Slide 88 text

88 Scoring with imbalanced data cross_val_score(SVC(), X_train, y_train) >>> array([ 0.9, 0.9, 0.9]) cross_val_score(DummyClassifier("most_frequent"), X_train, y_train) >>> array([ 0.9, 0.9, 0.9]) cross_val_score(SVC(), X_train, y_train, scoring="roc_auc") >>> array([ 1,0, 1.0, 1,0])

Slide 89

Slide 89 text

89 Scoring with imbalanced data cross_val_score(SVC(), X_train, y_train) >>> array([ 0.9, 0.9, 0.9]) cross_val_score(DummyClassifier("most_frequent"), X_train, y_train) >>> array([ 0.9, 0.9, 0.9]) cross_val_score(SVC(), X_train, y_train, scoring="roc_auc") >>> array([ 1,0, 1.0, 1,0])

Slide 90

Slide 90 text

90 Video Series Advanced Machine Learning with scikit-learn 50% Off Coupon Code: AUTHD

Slide 91

Slide 91 text

91 Video Series Advanced Machine Learning with scikit-learn 50% Off Coupon Code: AUTHD

Slide 92

Slide 92 text

92 CDS is hiring Research Engineers Work on your favorite data science open source project full time!

Slide 93

Slide 93 text

93 Thank you for your attention. @t3kcit @amueller [email protected] http://amueller.github.io