Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2021-11-27 Online Learning Python Library: river

Naka Masato
November 27, 2021

2021-11-27 Online Learning Python Library: river

Introduction of online learning Python library river.

- https://riverml.xyz/
- https://github.com/online-ml/river

Naka Masato

November 27, 2021
Tweet

More Decks by Naka Masato

Other Decks in Technology

Transcript

  1. Online Learning Python Library:
    river
    2021/11/27 Naka Masato

    View full-size slide

  2. 自己紹介
    名前 那珂将人
    経歴
    ● アルゴリズムエンジニアとしてレコメンドエンジン開発
    ● インフラ基盤整備
    GitHub:
    https://github.com/nakamasato
    Twitter: https://twitter.com/gymnstcs

    View full-size slide

  3. Content: river

    Python library for online/streaming learning https://riverml.xyz/

    2.9k stars https://github.com/online-ml/river

    View full-size slide

  4. Batch Learning
    Common steps:
    1. Loading (and preprocessing) the data
    2. Fitting a model to the data
    3. Computing the performance of the model on unseen data
    Drawbacks:
    1. Requires a lot of memory if dataset is huge
    2. Can't elegantly learn from new data
    3. Not easy to respond to changes of available features
    Some solutions: learn the data in chuck or mini-batches (Dask and Spark's MLlib.)

    View full-size slide

  5. Incremental learning (online/streaming learning)
    Characteristics:

    Learn a model for each observation.

    Feature scaling (running statistics).

    SGD (Stochastic Gradient Descent) to update weights.
    Pros

    Model can be updated seamlessly.

    Concept drift can be detected.

    Quick response to recent actions.
    Cons

    Performance might be not as good as batch learning.

    Systems become more complex.

    View full-size slide

  6. Batch learning vs. Online Learning in Implementation
    Batch Learning:
    Online Learning:

    Pipeline needs to be different.
    Preprocess Train Predict
    Preprocess Train Preprocess Train Preprocess Train Preprocess Train
    Preprocess and train for each observation

    View full-size slide

  7. Main components of river
    1. Estimator
    2. Pipeline
    a. Transformer
    b. Classifier/Regressor
    i. learn_one
    ii. predict_one
    3. datasets

    View full-size slide

  8. Pipeline in river
    Pipelines allow you to chain different
    steps into a sequence.

    One or more transformers

    Final step: classifier or regressor
    for supervised learning

    View full-size slide

  9. Train a model
    1. Define preprocesses and a model with Pipeline
    2. For each observation in the iteration for datasets
    a. Make a prediction.
    b. Update metrics.
    c. Update weight.

    View full-size slide

  10. Progressive Validation
    progressive_val_score:
    1. Get next sample.
    2. Make a prediction.
    3. Update a running average of the error.
    4. Update the model.
    All samples can be used as a validation set.
    In some situation, leakage might happen.

    View full-size slide

  11. Example: Prediction of taxi trip duration

    Obviously the duration of the trip, is only known once the taxi arrives at the
    desired destination.

    Instead of updating the model immediately after making a prediction, update it
    once the ground truth is available Delayed progressive validation

    https://www.kaggle.com/c/nyc-taxi-trip-duration

    https://maxhalford.github.io/blog/online-learning-evaluation/

    View full-size slide

  12. Delayed Progressive Validation
    progressive_val_score
    1. Specify moment to get the
    timestamp for the
    observation.
    2. Specify delay one of str, int,
    timedelta, or callable.
    Example: impression (moment)

    click or non-click (delayed)

    Model will be updated only
    when the delay has been passed.

    View full-size slide

  13. Challenges
    1. Deployment for production. (Integration with ML platforms)
    2. Save and load model for production.

    View full-size slide