Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2021-11-27 Online Learning Python Library: river

Naka Masato
November 27, 2021

2021-11-27 Online Learning Python Library: river

Introduction of online learning Python library river.

- https://riverml.xyz/
- https://github.com/online-ml/river

Naka Masato

November 27, 2021

More Decks by Naka Masato

Other Decks in Technology


  1. Online Learning Python Library: river 2021/11/27 Naka Masato

  2. 自己紹介 名前 那珂将人 経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato

    Twitter: https://twitter.com/gymnstcs
  3. Content: river • Python library for online/streaming learning https://riverml.xyz/ •

    2.9k stars https://github.com/online-ml/river
  4. Batch Learning Common steps: 1. Loading (and preprocessing) the data

    2. Fitting a model to the data 3. Computing the performance of the model on unseen data Drawbacks: 1. Requires a lot of memory if dataset is huge 2. Can't elegantly learn from new data 3. Not easy to respond to changes of available features Some solutions: learn the data in chuck or mini-batches (Dask and Spark's MLlib.)
  5. Incremental learning (online/streaming learning) Characteristics: • Learn a model for

    each observation. • Feature scaling (running statistics). • SGD (Stochastic Gradient Descent) to update weights. Pros • Model can be updated seamlessly. • Concept drift can be detected. • Quick response to recent actions. Cons • Performance might be not as good as batch learning. • Systems become more complex.
  6. Batch learning vs. Online Learning in Implementation Batch Learning: Online

    Learning: • Pipeline needs to be different. Preprocess Train Predict Preprocess Train Preprocess Train Preprocess Train Preprocess Train Preprocess and train for each observation
  7. Main components of river 1. Estimator 2. Pipeline a. Transformer

    b. Classifier/Regressor i. learn_one ii. predict_one 3. datasets
  8. Pipeline in river Pipelines allow you to chain different steps

    into a sequence. • One or more transformers • Final step: classifier or regressor for supervised learning
  9. Train a model 1. Define preprocesses and a model with

    Pipeline 2. For each observation in the iteration for datasets a. Make a prediction. b. Update metrics. c. Update weight.
  10. Progressive Validation progressive_val_score: 1. Get next sample. 2. Make a

    prediction. 3. Update a running average of the error. 4. Update the model. All samples can be used as a validation set. In some situation, leakage might happen.
  11. Example: Prediction of taxi trip duration • Obviously the duration

    of the trip, is only known once the taxi arrives at the desired destination. → Instead of updating the model immediately after making a prediction, update it once the ground truth is available Delayed progressive validation • https://www.kaggle.com/c/nyc-taxi-trip-duration • https://maxhalford.github.io/blog/online-learning-evaluation/
  12. Delayed Progressive Validation progressive_val_score 1. Specify moment to get the

    timestamp for the observation. 2. Specify delay one of str, int, timedelta, or callable. Example: impression (moment) → click or non-click (delayed) → Model will be updated only when the delay has been passed.
  13. Challenges 1. Deployment for production. (Integration with ML platforms) 2.

    Save and load model for production.