Intro to scikit-learn and what's new in 0.17

Intro to scikit-learn and what's new in 0.17

Talk given at a meetup at Indeed, Tokyo in November 2015.

Aee56554ec30edfd680e1c937ed4e54d?s=128

Olivier Grisel

November 16, 2015
Tweet

Transcript

  1. Intro to scikit-learn and what’s new in 0.17 Indeed -

    Tokyo, November 2015
  2. Outline • Machine Learning refresher • scikit-learn • Demo: interactive

    predictive modeling on Census Data with IPython notebook / pandas / scikit-learn • What’s new in scikit-learn 0.17 and what’s next
  3. Predictive modeling ~= machine learning • Make predictions of outcome

    on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
  4. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE
  5. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234
  6. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train)
  7. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train) Apartment 2 33 TRUE House 4 210 TRUE samples (test) ? ?
  8. Training text docs images sounds transactions Labels Machine Learning Algorithm

    Model Predictive Modeling Data Flow Feature vectors
  9. New text doc image sound transaction Model Expected Label Predictive

    Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
  10. Inventory forecasting & trends detection Predictive modeling in the wild

    Personalized radios Fraud detection Virality and readers engagement Predictive maintenance Personality matching
  11. • Library of Machine Learning algorithms • Focus on established

    methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
  12. Train data Train labels Model Fitted model Test data Predicted

    labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train)
  13. Train data Train labels Model Fitted model Test data Predicted

    labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test)
  14. Train data Train labels Model Fitted model Test data Predicted

    labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
  15. Support Vector Machine from sklearn.svm import SVC model = SVC(kernel="rbf",

    C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
  16. Linear Classifier from sklearn.linear_model import SGDClassifier model = SGDClassifier(alpha=1e-4, penalty="elasticnet")

    model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
  17. Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train,

    y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
  18. None
  19. None
  20. Demo time! http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/ master/sklearn_demos/Income%20classification.ipynb https://github.com/ogrisel/notebooks

  21. What’s new in 0.17 http://scikit-learn.org/stable/whats_new.html#version-0-17

  22. Speed Improvements credit: https://flic.kr/p/9cUpHh

  23. SAG solver • LogisticRegression, LogisticRegressionCV and Ridge regression with solver='sag'

    • Fastest convergence rate for large number of samples: • As good optimizer as L-BFGS but often faster • As efficient as SGD after few epochs
  24. Convergence on RCV1

  25. Test accuracy on RCV1

  26. Faster regression trees • Do not compute constant part of

    the criterion when searching for the best split value • Speed up between 1.1x and 2x depending on data and CPU architecture • Still not as fast as XGBoost but getting closer
  27. Barnes Hut T-SNE • T-SNE: Dimensionality reduction that preserves small

    distances • Very useful for visualization of high dim data when PCA does not work • BH-T-SNE: Approximate methods to skip useless pairwise distance computation • Still work to do to cut memory usage
  28. Embedding of the 1797x64 digits dataset: master: 61.1s gradient O(n²)

    new: 18.6s gradient O(n log(n))
  29. None
  30. T-SNE on MNIST

  31. Preprocessing tools credit: https://flic.kr/p/aw9o87

  32. Incremental Scalers • Incremental fitting with partial_fit method • Pre-process

    data that does not fit in RAM • MaxAbsScaler, MinMaxScaler, StandardScaler
  33. RobustScaler • Center on media, scale on IQR • Ignore

    outliers
  34. Topic Modeling credit: https://flic.kr/p/9L4DC

  35. Latent Dirichlet Allocation • Probabilistic model of word counts in

    document based on topics • Online solver: incremental fitting on data that does not fit in RAM • Based on an implementation by Matt Hoffman adapted for the scikit-learn common API
  36. LDA topics on 20 newsgroups Topic #9: key chip encryption

    keys clipper use security public technology bit Topic #11: memory use video bus monitor board ground pc ram need Topic #14: game team games play season hockey league players bike win Topic #19: drive scsi disk mac problem hard card apple drives controller
  37. New solver for NMF • Coordinate Descent solver to replace

    Projected Gradient solver • CD less sensitive to initialization scheme than PG • Change in hyper-parameters to make them more consistent with other scikit-learn models
  38. NMF topics on 20 newsgroups Topic #15: card video monitor

    vga bus cards drivers color driver ram Topic #16: team games players year season hockey play teams nhl league Topic #18: jesus christ christian bible christians faith law sin church christianity Topic #19: encryption chip clipper government privacy law escrow algorithm enforcement secure
  39. What’s next http://scikit-learn.org/dev/whats_new.html

  40. Recently merged in master • Multi-layer perceptrons with Adam /

    SGD / L-BFGS solvers (pure numpy, no GPU) • Gaussian Processes big refactoring • Anomaly detection with Isolation Forests • More flexible API for Cross-Validation splitters
  41. Thank you! • http://scikit-learn.org • https://github.com/scikit-learn/scikit-learn @ogrisel

  42. Combining Models from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA

    from sklearn.svm import SVC scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train)
  43. Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA from

    sklearn.svm import SVC from sklearn.pipeline import make_pipeline pipeline = make_pipeline( StandardScaler(), RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), ) pipeline.fit(X_train, y_train)
  44. Scoring manually stacked models scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)

    pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train) X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) y_pred = svm.predict(X_test_pca) accuracy_score(y_test, y_pred)
  45. Scoring a pipeline pipeline = make_pipeline( RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), )

    pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy_score(y_test, y_pred)
  46. Parameter search import numpy as np from sklearn.grid_search import RandomizedSearchCV

    params = { 'randomizedpca__n_components': [5, 10, 20], 'svc__C': np.logspace(-3, 3, 7), 'svc__gamma': np.logspace(-6, 0, 7), } search = RandomizedSearchCV(pipeline, params, n_iter=30, cv=5) search.fit(X_train, y_train) # search.best_params_, search.grid_scores_