Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Engineering Scikit-Learn V2

Engineering Scikit-Learn V2

Principles, challenges and lessons learned from building a machine learning library.

Andreas Mueller

April 14, 2016
Tweet

More Decks by Andreas Mueller

Other Decks in Technology

Transcript

  1. Andreas Mueller
    (NYU Center for Data Science, scikit-learn)
    Engineering scikit-learn

    View Slide

  2. Goals
    Achievements
    Methods
    Challenges
    Open Questions
    Closed Questions (wontfix)
    Outlook

    View Slide

  3. Goals

    View Slide

  4. High quality, easy to use machine learning library.

    View Slide

  5. High quality, easy to use machine learning library.
    Keep it usable, keep it maintainable.

    View Slide

  6. Simple things should be simple,
    complex things should be possible.
    Alan Kay

    View Slide

  7. Non-Goals
    Non-programmatic interfaces
    Algorithm development
    Cutting edge algorithms
    Structured, active, or reinforcement learning.

    View Slide

  8. Achievements

    View Slide

  9. View Slide

  10. We’ve been using it quite a lot for music recommendations at Spotify and I think it’s the
    most well-designed ML package I’ve seen so far.
    - spotify
    scikit-learn in one word: Awesome.
    - machinalis
    I’m constantly recommending that more developers and scientists try scikit-learn.
    - lovely
    The documentation is really thorough, as well, which makes the library quite easy to use.
    - OkCupid
    scikit-learn makes doing advanced analysis in Python accessible to anyone.
    - yhat

    View Slide

  11. Methods

    View Slide

  12. Scoping

    View Slide

  13. Scoping
    Matrix in,
    matrix out
    Widely
    useful
    Well
    established
    sklearn

    View Slide

  14. Simplicity
    est = Est()
    est.fit(X_train, y_train)
    est.score(X_test, y_test)

    View Slide

  15. grid = GridSearchCV(svm,param_grid)
    grid.fit(X_train, y_train)
    grid.score(X_test, y_test)
    Consistency

    View Slide

  16. Sensible Defaults
    Everything is default constructible!
    for clf in [KneighborsClassifier(),
    SVC(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA()]:
    clf.fit(X_train, y_train)
    print(clf.score(X_test, y_test))

    View Slide

  17. Common Tests
    classifiers = all_estimators(type_filter='classifier')
    for name, Classifier in classifiers:
    # test classfiers can handle non-array data
    yield check_classifier_data_not_an_array, name, Classifier
    # test classifiers trained on a single label
    # always return this label
    yield check_classifiers_one_label, name, Classifier
    yield check_classifiers_classes, name, Classifier
    yield check_classifiers_pickle, name, Classifier
    yield check_estimators_partial_fit_n_features, name, Classifier

    View Slide

  18. Flat Class Hierarchy, Few Types

    Numpy arrays / sparse matrices

    Estimators

    [Cross-validation objects]

    [Scorers]

    View Slide

  19. avoid code;
    avoid code rot!

    View Slide

  20. Three-Way Documentation

    View Slide

  21. Challenges

    View Slide

  22. Feature Creep

    View Slide

  23. Multi-Platform Support

    Linux / Mac / Windows / Solaris (no kidding)

    32bit / 64bit

    Python2.6 / Python2.7 / Python3.4 / Python3.5

    GCC, Clang, MSVC

    OpenBLAS, ATLAS, Accelerate

    And we want “one click” install

    View Slide

  24. Two Language Problem

    View Slide

  25. Project Size

    View Slide

  26. Backward compatibility

    View Slide

  27. Open
    Questions

    View Slide

  28. Backward compatible serialization

    View Slide

  29. Data Structures
    Categorical Variables

    View Slide

  30. Scaling and
    Parallelization

    View Slide

  31. scikit-learn
    CRAN
    PyPi

    View Slide

  32. Better Defaults
    Benchmarking ?

    View Slide

  33. Plotting

    View Slide

  34. Streaming Pipelines

    View Slide

  35. Correctness Testing

    View Slide

  36. Closed Questions

    View Slide

  37. Deep Learning

    View Slide

  38. Outlook

    View Slide

  39. Default Grids

    View Slide

  40. GP based parameter optimization
    From Eric Brochu, Vlad M. Cora and Nando de Freitas

    View Slide

  41. Better OneHotEncoder

    View Slide

  42. Better
    Feature Name
    Support

    View Slide

  43. Efficient
    GridSearchCV

    View Slide

  44. 44
    Video Series
    Advanced Machine Learning with scikit-learn

    View Slide

  45. 45
    Video Series
    Advanced Machine Learning with scikit-learn

    View Slide

  46. @amuellerml
    @amueller
    [email protected]

    View Slide