Managing ML pipelines with Airflow

2019 DevDay Managing ML Pipelines With Airflow > Khalid Huseynov
> LINE Plus AI Services Lab Data scientist/Software Engineer

Agenda > Introduction & architecture > ML pipeline & DAG
scheduling > Notifications > Monitoring training & inference stages

Introduction & architecture

Workflow as code Tasks DAG (Directed Acyclic Graph) of Tasks
Setting dependencies

Slave node 2 Airflow multi-node architecture and HA Master node
Airflow web server Airflow scheduler Slave node 1 Celery worker RabbitMQ queue Master node Airflow web server Airflow scheduler SQL Metastore … Celery worker Ansible server Active Standby Deploy DAGS/Apps > Deployment of DAGs and apps is done through Ansible > To provide HA • Health check of active server is monitored • If active server is down then standby server is activated • Also backup meta store and queues are maintained …

ML pipeline & DAG scheduling

Recommendation pipeline in use Generating input tables Phase select. Feature
preprocessing Training Inference Validation & Formatting Stream output System view DAG view

Multiple environments in one DAG prod phase Flow for prod
environment Common part beta phase Common part Flow for beta environment Branch operator

Frequent usage of dates ./preprocess.sh 20191024 … ./train.sh 20191024 …
./predict.sh 20191024 … Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output Airflow facilitates date/time usage and statelessness of tasks through Jinja templates

Dealing with dates through templates ./train.sh 2019-10-24 ./train.sh 20191024 ./train.sh
24-10-2019 ./train.sh 2019-10-21

DAG scheduling details • first run: 2019/11/21 at 10:00 (*JST)
• execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed • first run: 2019/11/23 at 10:00 (*JST) • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > Airflow stores task execution metadata in SQL database in UTC format * Assuming default_timezone = system and host is in JST time zone

Notifications

Success | failure | retry event callbacks Notification of task
state events > Task level callbacks for specific task > DAG level callbacks for all tasks in DAG

Monitoring training & inference stages

Monitor training stage of pipeline • Training and validation loss
monitoring • If loss is invalid (e.g. inf, nan), then • Restart training task with readjusted parameters (e.g. decrease learning rate) > Training key metrics Invalid loss Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output

Monitor and validate inference results • min/max/average of classification scores
• Validation of target users/items (e.g. duplicates or abnormal coverage) • KL-div of distribution of recommended items (e.g. comparison with weekly average) > Monitor inference metrics Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output

Propagate model signature to future tasks ffm.TH.201910241414_20191023 > Get context
-> task instance • Push by key > Get context -> task instance • Pull by key and task_id of operator that pushed Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output

Thanks!

Managing ML pipelines with Airflow

Managing ML pipelines with Airflow

LINE DevDay 2019

More Decks by LINE DevDay 2019

Other Decks in Technology

Featured

Transcript

2019 DevDay Managing ML Pipelines With Airflow > Khalid Huseynov

Agenda > Introduction & architecture > ML pipeline & DAG

Introduction & architecture

Workflow as code Tasks DAG (Directed Acyclic Graph) of Tasks

Slave node 2 Airflow multi-node architecture and HA Master node

ML pipeline & DAG scheduling

Recommendation pipeline in use Generating input tables Phase select. Feature

Multiple environments in one DAG prod phase Flow for prod

Frequent usage of dates ./preprocess.sh 20191024 … ./train.sh 20191024 …

Dealing with dates through templates ./train.sh 2019-10-24 ./train.sh 20191024 ./train.sh

DAG scheduling details • first run: 2019/11/21 at 10:00 (*JST)

Notifications

Success | failure | retry event callbacks Notification of task

Monitoring training & inference stages

Monitor training stage of pipeline • Training and validation loss

Monitor and validate inference results • min/max/average of classification scores

Propagate model signature to future tasks ffm.TH.201910241414_20191023 > Get context

Thanks!