Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to scikit-learn and what's new in 0.17

Intro to scikit-learn and what's new in 0.17

Talk given at a meetup at Indeed, Tokyo in November 2015.

Olivier Grisel

November 16, 2015
Tweet

More Decks by Olivier Grisel

Other Decks in Technology

Transcript

  1. Outline • Machine Learning refresher • scikit-learn • Demo: interactive

    predictive modeling on Census Data with IPython notebook / pandas / scikit-learn • What’s new in scikit-learn 0.17 and what’s next
  2. Predictive modeling ~= machine learning • Make predictions of outcome

    on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
  3. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE
  4. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234
  5. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train)
  6. type (category) # rooms (int) surface (float m2) public trans

    (boolean) Apartment 3 50 TRUE House 5 254 FALSE Duplex 4 68 TRUE Apartment 2 32 TRUE sold (float k€) 450 430 712 234 features target samples (train) Apartment 2 33 TRUE House 4 210 TRUE samples (test) ? ?
  7. Training text docs images sounds transactions Labels Machine Learning Algorithm

    Model Predictive Modeling Data Flow Feature vectors
  8. New text doc image sound transaction Model Expected Label Predictive

    Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
  9. Inventory forecasting & trends detection Predictive modeling in the wild

    Personalized radios Fraud detection Virality and readers engagement Predictive maintenance Personality matching
  10. • Library of Machine Learning algorithms • Focus on established

    methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
  11. Train data Train labels Model Fitted model Test data Predicted

    labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train)
  12. Train data Train labels Model Fitted model Test data Predicted

    labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test)
  13. Train data Train labels Model Fitted model Test data Predicted

    labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
  14. Support Vector Machine from sklearn.svm import SVC model = SVC(kernel="rbf",

    C=1.0, gamma=1e-4) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
  15. Linear Classifier from sklearn.linear_model import SGDClassifier model = SGDClassifier(alpha=1e-4, penalty="elasticnet")

    model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
  16. Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train,

    y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
  17. SAG solver • LogisticRegression, LogisticRegressionCV and Ridge regression with solver='sag'

    • Fastest convergence rate for large number of samples: • As good optimizer as L-BFGS but often faster • As efficient as SGD after few epochs
  18. Faster regression trees • Do not compute constant part of

    the criterion when searching for the best split value • Speed up between 1.1x and 2x depending on data and CPU architecture • Still not as fast as XGBoost but getting closer
  19. Barnes Hut T-SNE • T-SNE: Dimensionality reduction that preserves small

    distances • Very useful for visualization of high dim data when PCA does not work • BH-T-SNE: Approximate methods to skip useless pairwise distance computation • Still work to do to cut memory usage
  20. Incremental Scalers • Incremental fitting with partial_fit method • Pre-process

    data that does not fit in RAM • MaxAbsScaler, MinMaxScaler, StandardScaler
  21. Latent Dirichlet Allocation • Probabilistic model of word counts in

    document based on topics • Online solver: incremental fitting on data that does not fit in RAM • Based on an implementation by Matt Hoffman adapted for the scikit-learn common API
  22. LDA topics on 20 newsgroups Topic #9: key chip encryption

    keys clipper use security public technology bit Topic #11: memory use video bus monitor board ground pc ram need Topic #14: game team games play season hockey league players bike win Topic #19: drive scsi disk mac problem hard card apple drives controller
  23. New solver for NMF • Coordinate Descent solver to replace

    Projected Gradient solver • CD less sensitive to initialization scheme than PG • Change in hyper-parameters to make them more consistent with other scikit-learn models
  24. NMF topics on 20 newsgroups Topic #15: card video monitor

    vga bus cards drivers color driver ram Topic #16: team games players year season hockey play teams nhl league Topic #18: jesus christ christian bible christians faith law sin church christianity Topic #19: encryption chip clipper government privacy law escrow algorithm enforcement secure
  25. Recently merged in master • Multi-layer perceptrons with Adam /

    SGD / L-BFGS solvers (pure numpy, no GPU) • Gaussian Processes big refactoring • Anomaly detection with Isolation Forests • More flexible API for Cross-Validation splitters
  26. Combining Models from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA

    from sklearn.svm import SVC scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train)
  27. Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import RandomizedPCA from

    sklearn.svm import SVC from sklearn.pipeline import make_pipeline pipeline = make_pipeline( StandardScaler(), RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), ) pipeline.fit(X_train, y_train)
  28. Scoring manually stacked models scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train)

    pca = RandomizedPCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) svm = SVC(C=0.1, gamma=1e-3) svm.fit(X_train_pca, y_train) X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) y_pred = svm.predict(X_test_pca) accuracy_score(y_test, y_pred)
  29. Scoring a pipeline pipeline = make_pipeline( RandomizedPCA(n_components=10), SVC(C=0.1, gamma=1e-3), )

    pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy_score(y_test, y_pred)
  30. Parameter search import numpy as np from sklearn.grid_search import RandomizedSearchCV

    params = { 'randomizedpca__n_components': [5, 10, 20], 'svc__C': np.logspace(-3, 3, 7), 'svc__gamma': np.logspace(-6, 0, 7), } search = RandomizedSearchCV(pipeline, params, n_iter=30, cv=5) search.fit(X_train, y_train) # search.best_params_, search.grid_scores_