Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Engineering Scikit-Learn V2

Engineering Scikit-Learn V2

Principles, challenges and lessons learned from building a machine learning library.

Andreas Mueller

April 14, 2016

More Decks by Andreas Mueller

Other Decks in Technology


  1. Andreas Mueller (NYU Center for Data Science, scikit-learn) Engineering scikit-learn

  2. Goals Achievements Methods Challenges Open Questions Closed Questions (wontfix) Outlook

  3. Goals

  4. High quality, easy to use machine learning library.

  5. High quality, easy to use machine learning library. Keep it

    usable, keep it maintainable.
  6. Simple things should be simple, complex things should be possible.

    Alan Kay
  7. Non-Goals Non-programmatic interfaces Algorithm development Cutting edge algorithms Structured, active,

    or reinforcement learning.
  8. Achievements

  9. None
  10. We’ve been using it quite a lot for music recommendations

    at Spotify and I think it’s the most well-designed ML package I’ve seen so far. - spotify scikit-learn in one word: Awesome. - machinalis I’m constantly recommending that more developers and scientists try scikit-learn. - lovely The documentation is really thorough, as well, which makes the library quite easy to use. - OkCupid scikit-learn makes doing advanced analysis in Python accessible to anyone. - yhat
  11. Methods

  12. Scoping

  13. Scoping Matrix in, matrix out Widely useful Well established sklearn

  14. Simplicity est = Est() est.fit(X_train, y_train) est.score(X_test, y_test)

  15. grid = GridSearchCV(svm,param_grid) grid.fit(X_train, y_train) grid.score(X_test, y_test) Consistency

  16. Sensible Defaults Everything is default constructible! for clf in [KneighborsClassifier(),

    SVC(), DecisionTreeClassifier(), RandomForestClassifier(), AdaBoostClassifier(), GaussianNB(), LDA(), QDA()]: clf.fit(X_train, y_train) print(clf.score(X_test, y_test))
  17. Common Tests classifiers = all_estimators(type_filter='classifier') for name, Classifier in classifiers:

    # test classfiers can handle non-array data yield check_classifier_data_not_an_array, name, Classifier # test classifiers trained on a single label # always return this label yield check_classifiers_one_label, name, Classifier yield check_classifiers_classes, name, Classifier yield check_classifiers_pickle, name, Classifier yield check_estimators_partial_fit_n_features, name, Classifier
  18. Flat Class Hierarchy, Few Types • Numpy arrays / sparse

    matrices • Estimators • [Cross-validation objects] • [Scorers]
  19. avoid code; avoid code rot!

  20. Three-Way Documentation

  21. Challenges

  22. Feature Creep

  23. Multi-Platform Support • Linux / Mac / Windows / Solaris

    (no kidding) • 32bit / 64bit • Python2.6 / Python2.7 / Python3.4 / Python3.5 • GCC, Clang, MSVC • OpenBLAS, ATLAS, Accelerate • And we want “one click” install
  24. Two Language Problem

  25. Project Size

  26. Backward compatibility

  27. Open Questions

  28. Backward compatible serialization

  29. Data Structures Categorical Variables

  30. Scaling and Parallelization

  31. scikit-learn CRAN PyPi

  32. Better Defaults Benchmarking ?

  33. Plotting

  34. Streaming Pipelines

  35. Correctness Testing

  36. Closed Questions

  37. Deep Learning

  38. Outlook

  39. Default Grids

  40. GP based parameter optimization From Eric Brochu, Vlad M. Cora

    and Nando de Freitas
  41. Better OneHotEncoder

  42. Better Feature Name Support

  43. Efficient GridSearchCV

  44. 44 Video Series Advanced Machine Learning with scikit-learn

  45. 45 Video Series Advanced Machine Learning with scikit-learn

  46. @amuellerml @amueller amueller@nyu.edu