Slide 1

Slide 1 text

Online Learning Python Library: river 2021/11/27 Naka Masato

Slide 2

Slide 2 text

自己紹介 名前 那珂将人 経歴 ● アルゴリズムエンジニアとしてレコメンドエンジン開発 ● インフラ基盤整備 GitHub: https://github.com/nakamasato Twitter: https://twitter.com/gymnstcs

Slide 3

Slide 3 text

Content: river ● Python library for online/streaming learning https://riverml.xyz/ ● 2.9k stars https://github.com/online-ml/river

Slide 4

Slide 4 text

Batch Learning Common steps: 1. Loading (and preprocessing) the data 2. Fitting a model to the data 3. Computing the performance of the model on unseen data Drawbacks: 1. Requires a lot of memory if dataset is huge 2. Can't elegantly learn from new data 3. Not easy to respond to changes of available features Some solutions: learn the data in chuck or mini-batches (Dask and Spark's MLlib.)

Slide 5

Slide 5 text

Incremental learning (online/streaming learning) Characteristics: ● Learn a model for each observation. ● Feature scaling (running statistics). ● SGD (Stochastic Gradient Descent) to update weights. Pros ● Model can be updated seamlessly. ● Concept drift can be detected. ● Quick response to recent actions. Cons ● Performance might be not as good as batch learning. ● Systems become more complex.

Slide 6

Slide 6 text

Batch learning vs. Online Learning in Implementation Batch Learning: Online Learning: ● Pipeline needs to be different. Preprocess Train Predict Preprocess Train Preprocess Train Preprocess Train Preprocess Train Preprocess and train for each observation

Slide 7

Slide 7 text

Main components of river 1. Estimator 2. Pipeline a. Transformer b. Classifier/Regressor i. learn_one ii. predict_one 3. datasets

Slide 8

Slide 8 text

Pipeline in river Pipelines allow you to chain different steps into a sequence. ● One or more transformers ● Final step: classifier or regressor for supervised learning

Slide 9

Slide 9 text

Train a model 1. Define preprocesses and a model with Pipeline 2. For each observation in the iteration for datasets a. Make a prediction. b. Update metrics. c. Update weight.

Slide 10

Slide 10 text

Progressive Validation progressive_val_score: 1. Get next sample. 2. Make a prediction. 3. Update a running average of the error. 4. Update the model. All samples can be used as a validation set. In some situation, leakage might happen.

Slide 11

Slide 11 text

Example: Prediction of taxi trip duration ● Obviously the duration of the trip, is only known once the taxi arrives at the desired destination. → Instead of updating the model immediately after making a prediction, update it once the ground truth is available Delayed progressive validation ● https://www.kaggle.com/c/nyc-taxi-trip-duration ● https://maxhalford.github.io/blog/online-learning-evaluation/

Slide 12

Slide 12 text

Delayed Progressive Validation progressive_val_score 1. Specify moment to get the timestamp for the observation. 2. Specify delay one of str, int, timedelta, or callable. Example: impression (moment) → click or non-click (delayed) → Model will be updated only when the delay has been passed.

Slide 13

Slide 13 text

Challenges 1. Deployment for production. (Integration with ML platforms) 2. Save and load model for production.