Slide 1

Slide 1 text

2019 DevDay Managing ML Pipelines With Airflow > Khalid Huseynov > LINE Plus AI Services Lab Data scientist/Software Engineer

Slide 2

Slide 2 text

Agenda > Introduction & architecture > ML pipeline & DAG scheduling > Notifications > Monitoring training & inference stages

Slide 3

Slide 3 text

Introduction & architecture

Slide 4

Slide 4 text

Workflow as code Tasks DAG (Directed Acyclic Graph) of Tasks Setting dependencies

Slide 5

Slide 5 text

Slave node 2 Airflow multi-node architecture and HA Master node Airflow web server Airflow scheduler Slave node 1 Celery worker RabbitMQ queue Master node Airflow web server Airflow scheduler SQL Metastore … Celery worker Ansible server Active Standby Deploy DAGS/Apps > Deployment of DAGs and apps is done through Ansible > To provide HA • Health check of active server is monitored • If active server is down then standby server is activated • Also backup meta store and queues are maintained …

Slide 6

Slide 6 text

ML pipeline & DAG scheduling

Slide 7

Slide 7 text

Recommendation pipeline in use Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output System view DAG view

Slide 8

Slide 8 text

Multiple environments in one DAG prod phase Flow for prod environment Common part beta phase Common part Flow for beta environment Branch operator

Slide 9

Slide 9 text

Frequent usage of dates ./preprocess.sh 20191024 … ./train.sh 20191024 … ./predict.sh 20191024 … Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output Airflow facilitates date/time usage and statelessness of tasks through Jinja templates

Slide 10

Slide 10 text

Dealing with dates through templates ./train.sh 2019-10-24 ./train.sh 20191024 ./train.sh 24-10-2019 ./train.sh 2019-10-21

Slide 11

Slide 11 text

DAG scheduling details • first run: 2019/11/21 at 10:00 (*JST) • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed • first run: 2019/11/23 at 10:00 (*JST) • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > Airflow stores task execution metadata in SQL database in UTC format * Assuming default_timezone = system and host is in JST time zone

Slide 12

Slide 12 text

Notifications

Slide 13

Slide 13 text

Success | failure | retry event callbacks Notification of task state events > Task level callbacks for specific task > DAG level callbacks for all tasks in DAG

Slide 14

Slide 14 text

Monitoring training & inference stages

Slide 15

Slide 15 text

Monitor training stage of pipeline • Training and validation loss monitoring • If loss is invalid (e.g. inf, nan), then • Restart training task with readjusted parameters (e.g. decrease learning rate) > Training key metrics Invalid loss Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output

Slide 16

Slide 16 text

Monitor and validate inference results • min/max/average of classification scores • Validation of target users/items (e.g. duplicates or abnormal coverage) • KL-div of distribution of recommended items (e.g. comparison with weekly average) > Monitor inference metrics Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output

Slide 17

Slide 17 text

Propagate model signature to future tasks ffm.TH.201910241414_20191023 > Get context -> task instance • Push by key > Get context -> task instance • Pull by key and task_id of operator that pushed Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output

Slide 18

Slide 18 text

Thanks!