Slide 1

Slide 1 text

Machine Learning with Scikit-Learn Andreas Mueller (NYU Center for Data Science, scikit-learn) http://bit.ly/sklodsc

Slide 2

Slide 2 text

2 Me

Slide 3

Slide 3 text

3 Classification Regression Clustering Semi-Supervised Learning Feature Selection Feature Extraction Manifold Learning Dimensionality Reduction Kernel Approximation Hyperparameter Optimization Evaluation Metrics Out-of-core learning …...

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5 Get the notebooks! http://bit.ly/sklosdc

Slide 6

Slide 6 text

6 http://scikit-learn.org/

Slide 7

Slide 7 text

7 Hi Andy, I just received an email from the first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks

Slide 8

Slide 8 text

8 Hi Andy, I just received an email from the first tutorial speaker, presenting right before you, saying he's ill and won't be able to make it. I know you have already committed yourself to two presentations, but is there anyway you could increase your tutorial time slot, maybe just offer time to try out what you've taught? Otherwise I have to do some kind of modern dance interpretation of Python in data :-) -Leah Hi Andreas, I am very interested in your Machine Learning background. I work for X Recruiting who have been engaged by Z, a worldwide leading supplier of Y. We are expanding the core engineering team and we are looking for really passionate engineers who want to create their own story and help millions of people. Can we find a time for a call to chat for a few minutes about this? Thanks

Slide 9

Slide 9 text

9 Supervised Machine Learning Training Data Training Labels Model

Slide 10

Slide 10 text

10 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction

Slide 11

Slide 11 text

11 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction Test Labels Evaluation

Slide 12

Slide 12 text

12 Supervised Machine Learning Training Data Test Data Training Labels Model Prediction Test Labels Evaluation Training Generalization

Slide 13

Slide 13 text

13 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Training Labels Model

Slide 14

Slide 14 text

14 clf = RandomForestClassifier() clf.fit(X_train, y_train) Training Data Test Data Training Labels Model Prediction y_pred = clf.predict(X_test)

Slide 15

Slide 15 text

15 clf = RandomForestClassifier() clf.fit(X_train, y_train) clf.score(X_test, y_test) Training Data Test Data Training Labels Model Prediction Test Labels Evaluation y_pred = clf.predict(X_test)

Slide 16

Slide 16 text

16 IPython Notebook: Part 1 - Introduction to Scikit-learn

Slide 17

Slide 17 text

17 Unsupervised Machine Learning Training Data Model

Slide 18

Slide 18 text

18 Unsupervised Machine Learning Training Data Test Data Model New View

Slide 19

Slide 19 text

19 pca = PCA() pca.fit(X_train) X_new = pca.transform(X_test) Training Data Test Data Model Transformation Unsupervised Transformations

Slide 20

Slide 20 text

20 IPython Notebook: Part 2 – Unsupervised Transformers

Slide 21

Slide 21 text

21 Basic API estimator.fit(X, [y]) estimator.predict estimator.transform Classification Preprocessing Regression Dimensionality reduction Clustering Feature selection Feature extraction

Slide 22

Slide 22 text

22 All Data Training data Test data

Slide 23

Slide 23 text

23 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Slide 24

Slide 24 text

24 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1

Slide 25

Slide 25 text

25 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2

Slide 26

Slide 26 text

26 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Split 1 Split 2 Split 3 Split 4 Split 5

Slide 27

Slide 27 text

27 IPython Notebook: Part 3 - Cross-validation

Slide 28

Slide 28 text

28

Slide 29

Slide 29 text

29

Slide 30

Slide 30 text

30 All Data Training data Test data

Slide 31

Slide 31 text

31 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Split 1 Split 2 Split 3 Split 4 Split 5

Slide 32

Slide 32 text

32 All Data Training data Test data Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test data Finding Parameters Final evaluation Split 1 Split 2 Split 3 Split 4 Split 5

Slide 33

Slide 33 text

33 SVC(C=0.001, gamma=0.001)

Slide 34

Slide 34 text

34 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001)

Slide 35

Slide 35 text

35 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01)

Slide 36

Slide 36 text

36 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1)

Slide 37

Slide 37 text

37 SVC(C=0.001, gamma=0.001) SVC(C=0.01, gamma=0.001) SVC(C=0.1, gamma=0.001) SVC(C=1, gamma=0.001) SVC(C=10, gamma=0.001) SVC(C=0.001, gamma=0.01) SVC(C=0.01, gamma=0.01) SVC(C=0.1, gamma=0.01) SVC(C=1, gamma=0.01) SVC(C=10, gamma=0.01) SVC(C=0.001, gamma=0.1) SVC(C=0.01, gamma=0.1) SVC(C=0.1, gamma=0.1) SVC(C=1, gamma=0.1) SVC(C=10, gamma=0.1) SVC(C=0.001, gamma=1) SVC(C=0.01, gamma=1) SVC(C=0.1, gamma=1) SVC(C=1, gamma=1) SVC(C=10, gamma=1) SVC(C=0.001, gamma=10) SVC(C=0.01, gamma=10) SVC(C=0.1, gamma=10) SVC(C=1, gamma=10) SVC(C=10, gamma=10)

Slide 38

Slide 38 text

38 IPython Notebook: Part 4 – Grid Searches

Slide 39

Slide 39 text

39 Training Data Training Labels Model

Slide 40

Slide 40 text

40 Training Data Training Labels Model

Slide 41

Slide 41 text

41 Training Data Training Labels Model Feature Extraction

Slide 42

Slide 42 text

42 Training Data Training Labels Model Feature Extraction Scaling

Slide 43

Slide 43 text

43 Training Data Training Labels Model Feature Extraction Scaling Feature Selection

Slide 44

Slide 44 text

44 Training Data Training Labels Model Feature Extraction Scaling Feature Selection Cross Validation

Slide 45

Slide 45 text

45 Training Data Training Labels Model Feature Extraction Scaling Feature Selection Cross Validation

Slide 46

Slide 46 text

46 IPython Notebook: Part 5 - Preprocessing and Pipelines

Slide 47

Slide 47 text

47 Do cross-validation over all steps jointly. Keep a separate test set until the very end.

Slide 48

Slide 48 text

48 Bag Of Word Representations CountVectorizer / TfidfVectorizer

Slide 49

Slide 49 text

49 Bag Of Word Representations “This is how you get ants.” CountVectorizer / TfidfVectorizer

Slide 50

Slide 50 text

50 Bag Of Word Representations “This is how you get ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer

Slide 51

Slide 51 text

51 Bag Of Word Representations “This is how you get ants.” ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 52

Slide 52 text

52 Bag Of Word Representations “This is how you get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 53

Slide 53 text

53 Application: Insult detection

Slide 54

Slide 54 text

54 Application: Insult detection i really don't understand your point. It seems that you are mixing apples and oranges.

Slide 55

Slide 55 text

55 Application: Insult detection Clearly you're a fucktard. i really don't understand your point. It seems that you are mixing apples and oranges.

Slide 56

Slide 56 text

56 IPython Notebook: Part 6 - Working With Text Data

Slide 57

Slide 57 text

57 Feature Union Training Data Training Labels Model Feature Extraction I Feature Extraction II

Slide 58

Slide 58 text

58 IPython Notebook: Part 7 – FeatureUnion

Slide 59

Slide 59 text

59 Overfitting and Underfitting Model complexity Accuracy Training

Slide 60

Slide 60 text

60 Overfitting and Underfitting Model complexity Accuracy Training Generalization

Slide 61

Slide 61 text

61 Overfitting and Underfitting Model complexity Accuracy Training Generalization Underfitting Overfitting Sweet spot

Slide 62

Slide 62 text

62 Linear SVM

Slide 63

Slide 63 text

63 Linear SVM

Slide 64

Slide 64 text

64 (RBF) Kernel SVM

Slide 65

Slide 65 text

65 (RBF) Kernel SVM

Slide 66

Slide 66 text

66 (RBF) Kernel SVM

Slide 67

Slide 67 text

67 (RBF) Kernel SVM

Slide 68

Slide 68 text

68 Decision Trees

Slide 69

Slide 69 text

69 Decision Trees

Slide 70

Slide 70 text

70 Decision Trees

Slide 71

Slide 71 text

71 Decision Trees

Slide 72

Slide 72 text

72 Decision Trees

Slide 73

Slide 73 text

73 Decision Trees

Slide 74

Slide 74 text

74 Random Forests

Slide 75

Slide 75 text

75 Random Forests

Slide 76

Slide 76 text

76 Random Forests

Slide 77

Slide 77 text

77 Validation Curves train_scores, test_scores = validation_curve(SVC(), X, y, param_name="gamma", param_range=param_range)

Slide 78

Slide 78 text

78 train_sizes, train_scores, test_scores = learning_curve( estimator, X, y,train_sizes=train_sizes) Learning Curves

Slide 79

Slide 79 text

79 Randomized Parameter Search

Slide 80

Slide 80 text

80 Randomized Parameter Search Source: Bergstra and Bengio

Slide 81

Slide 81 text

81 Randomized Parameter Search Source: Bergstra and Bengio Step-size free for continuous parameters Decouples runtime from search-space size Robust against irrelevant parameters

Slide 82

Slide 82 text

82 Randomized Parameter Search ● Always use distributions for continuous variables. ● Don't use for low dimensional spaces. ● Future: Bayesian optimization based search.

Slide 83

Slide 83 text

83 IPython Notebook: Part 8 – Randomized Parameter Search

Slide 84

Slide 84 text

84 Generalized Cross-Validation and Path Algorithms

Slide 85

Slide 85 text

85 IPython Notebook: Part 9 – Generalized Cross-Validation

Slide 86

Slide 86 text

86 Linear Models Feature Selection Tree-Based models [possible] LogisticRegressionCV [new] RFECV [DecisionTreeCV] RidgeCV [RandomForestClassifierCV] RidgeClassifierCV [GradientBoostingClassifierCV] LarsCV ElasticNetCV ...

Slide 87

Slide 87 text

87 Scoring Functions

Slide 88

Slide 88 text

88 Default: Accuracy (classification) R2 (regression) GridSeachCV RandomizedSearchCV cross_val_score ...CV

Slide 89

Slide 89 text

89 IPython Notebook: Part 10 – Scoring Metrics

Slide 90

Slide 90 text

90 Out of Core Learning

Slide 91

Slide 91 text

91 Or: save ourself the effort

Slide 92

Slide 92 text

92 Think twice! ● Old laptop: 4GB Ram ● 1073741824 float32 ● Or 1mio data points with 1000 features ● EC2 : 256 GB Ram ● 68719476736 float32 ● Or 68mio data points with 1000 features

Slide 93

Slide 93 text

93 Supported Algorithms ● All SGDClassifier derivatives ● Naive Bayes ● MinibatchKMeans ● Birch ● IncrementalPCA ● MiniBatchDictionaryLearning

Slide 94

Slide 94 text

94 IPython Notebook: Part 11 – Out Of Core Learning

Slide 95

Slide 95 text

95 Stateless Transformers ● Normalizer ● HashingVectorizer ● RBFSampler (and other kernel approx)

Slide 96

Slide 96 text

96 Bag Of Word Representations “This is how you get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] CountVectorizer / TfidfVectorizer tokenizer Sparse matrix encoding Build a vocabulary over all documents ['aardvak', 'amsterdam', 'ants', ... 'you', 'your', 'zyxst']

Slide 97

Slide 97 text

97 Hashing Trick “This is how you get ants.” [0, …, 0, 1, 0, … , 0, 1 , 0, …, 0, 1, 0, …., 0 ] ants get you aardvak zyxst ['this', 'is', 'how', 'you', 'get', 'ants'] HashingVectorizer tokenizer Sparse matrix encoding hashing [hash('this'), hash('is'), hash('how'), hash('you'), hash('get'), hash('ants')] = [832412, 223788, 366226, 81185, 835749, 173092]

Slide 98

Slide 98 text

98 IPython Notebook: Part 12 – Out Of Core Learning for Text

Slide 99

Slide 99 text

99 http://scikit-learn.org/

Slide 100

Slide 100 text

100 CDS is hiring Research Engineers

Slide 101

Slide 101 text

101 Thank you for your attention. @t3kcit @amueller [email protected] http://amueller.github.io