Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to scikit-learn and what's new in 0.17

Intro to scikit-learn and what's new in 0.17

Talk given at a meetup at Indeed, Tokyo in November 2015.

Olivier Grisel

November 16, 2015
Tweet

More Decks by Olivier Grisel

Other Decks in Technology

Transcript

  1. Intro to scikit-learn
    and what’s new in 0.17
    Indeed - Tokyo, November 2015

    View Slide

  2. Outline
    • Machine Learning refresher
    • scikit-learn
    • Demo: interactive predictive modeling on Census
    Data with IPython notebook / pandas / scikit-learn
    • What’s new in scikit-learn 0.17 and what’s next

    View Slide

  3. Predictive modeling
    ~= machine learning
    • Make predictions of outcome on new data
    • Extract the structure of historical data
    • Statistical tools to summarize the training data into
    a executable predictive model
    • Alternative to hard-coded rules written by experts

    View Slide

  4. type
    (category)
    # rooms
    (int)
    surface
    (float m2)
    public trans
    (boolean)
    Apartment 3 50 TRUE
    House 5 254 FALSE
    Duplex 4 68 TRUE
    Apartment 2 32 TRUE

    View Slide

  5. type
    (category)
    # rooms
    (int)
    surface
    (float m2)
    public trans
    (boolean)
    Apartment 3 50 TRUE
    House 5 254 FALSE
    Duplex 4 68 TRUE
    Apartment 2 32 TRUE
    sold
    (float k€)
    450
    430
    712
    234

    View Slide

  6. type
    (category)
    # rooms
    (int)
    surface
    (float m2)
    public trans
    (boolean)
    Apartment 3 50 TRUE
    House 5 254 FALSE
    Duplex 4 68 TRUE
    Apartment 2 32 TRUE
    sold
    (float k€)
    450
    430
    712
    234
    features target
    samples
    (train)

    View Slide

  7. type
    (category)
    # rooms
    (int)
    surface
    (float m2)
    public trans
    (boolean)
    Apartment 3 50 TRUE
    House 5 254 FALSE
    Duplex 4 68 TRUE
    Apartment 2 32 TRUE
    sold
    (float k€)
    450
    430
    712
    234
    features target
    samples
    (train)
    Apartment 2 33 TRUE
    House 4 210 TRUE
    samples
    (test)
    ?
    ?

    View Slide

  8. Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Model
    Predictive Modeling Data Flow
    Feature vectors

    View Slide

  9. New
    text doc
    image
    sound
    transaction
    Model
    Expected
    Label
    Predictive Modeling Data Flow
    Feature vector
    Training
    text docs
    images
    sounds
    transactions
    Labels
    Machine
    Learning
    Algorithm
    Feature vectors

    View Slide

  10. Inventory forecasting
    & trends detection
    Predictive modeling
    in the wild
    Personalized
    radios
    Fraud detection
    Virality and readers
    engagement
    Predictive maintenance Personality matching

    View Slide

  11. • Library of Machine Learning algorithms
    • Focus on established methods (e.g. ESL-II)
    • Open Source (BSD)
    • Simple fit / predict / transform API
    • Python / NumPy / SciPy / Cython
    • Model Assessment, Selection & Ensembles

    View Slide

  12. Train data
    Train labels
    Model
    Fitted
    model
    Test data
    Predicted labels
    Test labels Evaluation
    model = LogisticRegression(C=1)
    model.fit(X_train, y_train)

    View Slide

  13. Train data
    Train labels
    Model
    Fitted
    model
    Test data
    Predicted labels
    Test labels Evaluation
    model = LogisticRegression(C=1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    View Slide

  14. Train data
    Train labels
    Model
    Fitted
    model
    Test data
    Predicted labels
    Test labels Evaluation
    model = LogisticRegression(C=1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy_score(y_test, y_pred)

    View Slide

  15. Support Vector Machine
    from sklearn.svm import SVC
    model = SVC(kernel="rbf", C=1.0, gamma=1e-4)
    model.fit(X_train, y_train)
    y_predicted = model.predict(X_test)
    from sklearn.metrics import f1_score
    f1_score(y_test, y_predicted)

    View Slide

  16. Linear Classifier
    from sklearn.linear_model import SGDClassifier
    model = SGDClassifier(alpha=1e-4,
    penalty="elasticnet")
    model.fit(X_train, y_train)
    y_predicted = model.predict(X_test)
    from sklearn.metrics import f1_score
    f1_score(y_test, y_predicted)

    View Slide

  17. Random Forests
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(n_estimators=200)
    model.fit(X_train, y_train)
    y_predicted = model.predict(X_test)
    from sklearn.metrics import f1_score
    f1_score(y_test, y_predicted)

    View Slide

  18. View Slide

  19. View Slide

  20. Demo time!
    http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/
    master/sklearn_demos/Income%20classification.ipynb
    https://github.com/ogrisel/notebooks

    View Slide

  21. What’s new in 0.17
    http://scikit-learn.org/stable/whats_new.html#version-0-17

    View Slide

  22. Speed Improvements
    credit: https://flic.kr/p/9cUpHh

    View Slide

  23. SAG solver
    • LogisticRegression,
    LogisticRegressionCV and Ridge regression
    with solver='sag'
    • Fastest convergence rate for large number of
    samples:
    • As good optimizer as L-BFGS but often faster
    • As efficient as SGD after few epochs

    View Slide

  24. Convergence on RCV1

    View Slide

  25. Test accuracy on RCV1

    View Slide

  26. Faster regression trees
    • Do not compute constant part of the criterion when
    searching for the best split value
    • Speed up between 1.1x and 2x depending on data
    and CPU architecture
    • Still not as fast as XGBoost but getting closer

    View Slide

  27. Barnes Hut T-SNE
    • T-SNE: Dimensionality reduction that preserves
    small distances
    • Very useful for visualization of high dim data when
    PCA does not work
    • BH-T-SNE: Approximate methods to skip useless
    pairwise distance computation
    • Still work to do to cut memory usage

    View Slide

  28. Embedding of the 1797x64 digits dataset:
    master: 61.1s gradient O(n²)
    new: 18.6s gradient O(n log(n))

    View Slide

  29. View Slide

  30. T-SNE on MNIST

    View Slide

  31. Preprocessing tools
    credit: https://flic.kr/p/aw9o87

    View Slide

  32. Incremental Scalers
    • Incremental fitting with partial_fit method
    • Pre-process data that does not fit in RAM
    • MaxAbsScaler, MinMaxScaler,
    StandardScaler

    View Slide

  33. RobustScaler
    • Center on media, scale on IQR
    • Ignore outliers

    View Slide

  34. Topic Modeling
    credit: https://flic.kr/p/9L4DC

    View Slide

  35. Latent Dirichlet Allocation
    • Probabilistic model of word counts in document
    based on topics
    • Online solver: incremental fitting on data that does
    not fit in RAM
    • Based on an implementation by Matt Hoffman
    adapted for the scikit-learn common API

    View Slide

  36. LDA topics on 20
    newsgroups
    Topic #9:
    key chip encryption keys clipper use security public
    technology bit
    Topic #11:
    memory use video bus monitor board ground pc ram need
    Topic #14:
    game team games play season hockey league players bike
    win
    Topic #19:
    drive scsi disk mac problem hard card apple drives controller

    View Slide

  37. New solver for NMF
    • Coordinate Descent solver to replace Projected
    Gradient solver
    • CD less sensitive to initialization scheme than PG
    • Change in hyper-parameters to make them more
    consistent with other scikit-learn models

    View Slide

  38. NMF topics on 20
    newsgroups
    Topic #15:
    card video monitor vga bus cards drivers color driver ram
    Topic #16:
    team games players year season hockey play teams nhl
    league
    Topic #18:
    jesus christ christian bible christians faith law sin church
    christianity
    Topic #19:
    encryption chip clipper government privacy law escrow
    algorithm enforcement secure

    View Slide

  39. What’s next
    http://scikit-learn.org/dev/whats_new.html

    View Slide

  40. Recently merged in master
    • Multi-layer perceptrons with Adam / SGD / L-BFGS
    solvers (pure numpy, no GPU)
    • Gaussian Processes big refactoring
    • Anomaly detection with Isolation Forests
    • More flexible API for Cross-Validation splitters

    View Slide

  41. Thank you!
    • http://scikit-learn.org
    • https://github.com/scikit-learn/scikit-learn
    @ogrisel

    View Slide

  42. Combining Models
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import RandomizedPCA
    from sklearn.svm import SVC
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    pca = RandomizedPCA(n_components=10)
    X_train_pca = pca.fit_transform(X_train_scaled)
    svm = SVC(C=0.1, gamma=1e-3)
    svm.fit(X_train_pca, y_train)

    View Slide

  43. Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import RandomizedPCA
    from sklearn.svm import SVC
    from sklearn.pipeline import make_pipeline
    pipeline = make_pipeline(
    StandardScaler(),
    RandomizedPCA(n_components=10),
    SVC(C=0.1, gamma=1e-3),
    )
    pipeline.fit(X_train, y_train)

    View Slide

  44. Scoring manually
    stacked models
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    pca = RandomizedPCA(n_components=10)
    X_train_pca = pca.fit_transform(X_train_scaled)
    svm = SVC(C=0.1, gamma=1e-3)
    svm.fit(X_train_pca, y_train)
    X_test_scaled = scaler.transform(X_test)
    X_test_pca = pca.transform(X_test_scaled)
    y_pred = svm.predict(X_test_pca)
    accuracy_score(y_test, y_pred)

    View Slide

  45. Scoring a pipeline
    pipeline = make_pipeline(
    RandomizedPCA(n_components=10),
    SVC(C=0.1, gamma=1e-3),
    )
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy_score(y_test, y_pred)

    View Slide

  46. Parameter search
    import numpy as np
    from sklearn.grid_search import RandomizedSearchCV
    params = {
    'randomizedpca__n_components': [5, 10, 20],
    'svc__C': np.logspace(-3, 3, 7),
    'svc__gamma': np.logspace(-6, 0, 7),
    }
    search = RandomizedSearchCV(pipeline, params,
    n_iter=30, cv=5)
    search.fit(X_train, y_train)
    # search.best_params_, search.grid_scores_

    View Slide