Slide 1

Slide 1 text

Andreas Mueller (NYU Center for Data Science, scikit-learn) Engineering scikit-learn

Slide 2

Slide 2 text

Goals Achievements Methods Challenges Open Questions Closed Questions (wontfix) Outlook

Slide 3

Slide 3 text

Goals

Slide 4

Slide 4 text

High quality, easy to use machine learning library.

Slide 5

Slide 5 text

High quality, easy to use machine learning library. Keep it usable, keep it maintainable.

Slide 6

Slide 6 text

Simple things should be simple, complex things should be possible. Alan Kay

Slide 7

Slide 7 text

Non-Goals Non-programmatic interfaces Algorithm development Cutting edge algorithms Structured, active, or reinforcement learning.

Slide 8

Slide 8 text

Achievements

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

We’ve been using it quite a lot for music recommendations at Spotify and I think it’s the most well-designed ML package I’ve seen so far. - spotify scikit-learn in one word: Awesome. - machinalis I’m constantly recommending that more developers and scientists try scikit-learn. - lovely The documentation is really thorough, as well, which makes the library quite easy to use. - OkCupid scikit-learn makes doing advanced analysis in Python accessible to anyone. - yhat

Slide 11

Slide 11 text

Methods

Slide 12

Slide 12 text

Scoping

Slide 13

Slide 13 text

Scoping Matrix in, matrix out Widely useful Well established sklearn

Slide 14

Slide 14 text

Simplicity est = Est() est.fit(X_train, y_train) est.score(X_test, y_test)

Slide 15

Slide 15 text

grid = GridSearchCV(svm,param_grid) grid.fit(X_train, y_train) grid.score(X_test, y_test) Consistency

Slide 16

Slide 16 text

Sensible Defaults Everything is default constructible! for clf in [KneighborsClassifier(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), AdaBoostClassifier(), GaussianNB(), LDA(), QDA()]: clf.fit(X_train, y_train) print(clf.score(X_test, y_test))

Slide 17

Slide 17 text

Common Tests classifiers = all_estimators(type_filter='classifier') for name, Classifier in classifiers: # test classfiers can handle non-array data yield check_classifier_data_not_an_array, name, Classifier # test classifiers trained on a single label # always return this label yield check_classifiers_one_label, name, Classifier yield check_classifiers_classes, name, Classifier yield check_classifiers_pickle, name, Classifier yield check_estimators_partial_fit_n_features, name, Classifier

Slide 18

Slide 18 text

Flat Class Hierarchy, Few Types ● Numpy arrays / sparse matrices ● Estimators ● [Cross-validation objects] ● [Scorers]

Slide 19

Slide 19 text

avoid code; avoid code rot!

Slide 20

Slide 20 text

Three-Way Documentation

Slide 21

Slide 21 text

Challenges

Slide 22

Slide 22 text

Feature Creep

Slide 23

Slide 23 text

Multi-Platform Support ● Linux / Mac / Windows / Solaris (no kidding) ● 32bit / 64bit ● Python2.6 / Python2.7 / Python3.4 / Python3.5 ● GCC, Clang, MSVC ● OpenBLAS, ATLAS, Accelerate ● And we want “one click” install

Slide 24

Slide 24 text

Two Language Problem

Slide 25

Slide 25 text

Project Size

Slide 26

Slide 26 text

Backward compatibility

Slide 27

Slide 27 text

Open Questions

Slide 28

Slide 28 text

Backward compatible serialization

Slide 29

Slide 29 text

Data Structures Categorical Variables

Slide 30

Slide 30 text

Scaling and Parallelization

Slide 31

Slide 31 text

scikit-learn CRAN PyPi

Slide 32

Slide 32 text

Better Defaults Benchmarking ?

Slide 33

Slide 33 text

Plotting

Slide 34

Slide 34 text

Streaming Pipelines

Slide 35

Slide 35 text

Correctness Testing

Slide 36

Slide 36 text

Closed Questions

Slide 37

Slide 37 text

Deep Learning

Slide 38

Slide 38 text

Outlook

Slide 39

Slide 39 text

Default Grids

Slide 40

Slide 40 text

GP based parameter optimization From Eric Brochu, Vlad M. Cora and Nando de Freitas

Slide 41

Slide 41 text

Better OneHotEncoder

Slide 42

Slide 42 text

Better Feature Name Support

Slide 43

Slide 43 text

Efficient GridSearchCV

Slide 44

Slide 44 text

44 Video Series Advanced Machine Learning with scikit-learn

Slide 45

Slide 45 text

45 Video Series Advanced Machine Learning with scikit-learn

Slide 46

Slide 46 text

@amuellerml @amueller [email protected]