2021-11-27 Online Learning Python Library: river

Online Learning Python Library: river 2021/11/27 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato
Twitter: https://twitter.com/gymnstcs

Content: river • Python library for online/streaming learning https://riverml.xyz/ •
2.9k stars https://github.com/online-ml/river

Batch Learning Common steps: 1. Loading (and preprocessing) the data
2. Fitting a model to the data 3. Computing the performance of the model on unseen data Drawbacks: 1. Requires a lot of memory if dataset is huge 2. Can't elegantly learn from new data 3. Not easy to respond to changes of available features Some solutions: learn the data in chuck or mini-batches (Dask and Spark's MLlib.)

Incremental learning (online/streaming learning) Characteristics: • Learn a model for
each observation. • Feature scaling (running statistics). • SGD (Stochastic Gradient Descent) to update weights. Pros • Model can be updated seamlessly. • Concept drift can be detected. • Quick response to recent actions. Cons • Performance might be not as good as batch learning. • Systems become more complex.

Batch learning vs. Online Learning in Implementation Batch Learning: Online
Learning: • Pipeline needs to be diﬀerent. Preprocess Train Predict Preprocess Train Preprocess Train Preprocess Train Preprocess Train Preprocess and train for each observation

Main components of river 1. Estimator 2. Pipeline a. Transformer
b. Classiﬁer/Regressor i. learn_one ii. predict_one 3. datasets

Pipeline in river Pipelines allow you to chain diﬀerent steps
into a sequence. • One or more transformers • Final step: classiﬁer or regressor for supervised learning

Train a model 1. Deﬁne preprocesses and a model with
Pipeline 2. For each observation in the iteration for datasets a. Make a prediction. b. Update metrics. c. Update weight.

Progressive Validation progressive_val_score: 1. Get next sample. 2. Make a
prediction. 3. Update a running average of the error. 4. Update the model. All samples can be used as a validation set. In some situation, leakage might happen.

Example: Prediction of taxi trip duration • Obviously the duration
of the trip, is only known once the taxi arrives at the desired destination. → Instead of updating the model immediately after making a prediction, update it once the ground truth is available Delayed progressive validation • https://www.kaggle.com/c/nyc-taxi-trip-duration • https://maxhalford.github.io/blog/online-learning-evaluation/

Delayed Progressive Validation progressive_val_score 1. Specify moment to get the
timestamp for the observation. 2. Specify delay one of str, int, timedelta, or callable. Example: impression (moment) → click or non-click (delayed) → Model will be updated only when the delay has been passed.

Challenges 1. Deployment for production. (Integration with ML platforms) 2.
Save and load model for production.

2021-11-27 Online Learning Python Library: river

2021-11-27 Online Learning Python Library: river

Naka Masato

More Decks by Naka Masato

Other Decks in Technology

Featured

Transcript

Online Learning Python Library: river 2021/11/27 Naka Masato

自己紹介名前那珂将人経歴 • アルゴリズムエンジニアとしてレコメンドエンジン開発 • インフラ基盤整備 GitHub: https://github.com/nakamasato

Content: river • Python library for online/streaming learning https://riverml.xyz/ •

Batch Learning Common steps: 1. Loading (and preprocessing) the data

Incremental learning (online/streaming learning) Characteristics: • Learn a model for

Batch learning vs. Online Learning in Implementation Batch Learning: Online

Main components of river 1. Estimator 2. Pipeline a. Transformer

Pipeline in river Pipelines allow you to chain diﬀerent steps

Train a model 1. Deﬁne preprocesses and a model with

Progressive Validation progressive_val_score: 1. Get next sample. 2. Make a

Example: Prediction of taxi trip duration • Obviously the duration

Delayed Progressive Validation progressive_val_score 1. Specify moment to get the

Challenges 1. Deployment for production. (Integration with ML platforms) 2.