Managing ML pipelines with Airflow

Managing ML pipelines with Airflow

Khalid Huseynov
LINE Plus AI Services Lab Data scientist/Software Engineer
https://linedevday.linecorp.com/jp/2019/sessions/S1-19

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 20, 2019
Tweet

Transcript

  1. 2019 DevDay Managing ML Pipelines With Airflow > Khalid Huseynov

    > LINE Plus AI Services Lab Data scientist/Software Engineer
  2. Agenda > Introduction & architecture > ML pipeline & DAG

    scheduling > Notifications > Monitoring training & inference stages
  3. Introduction & architecture

  4. Workflow as code Tasks DAG (Directed Acyclic Graph) of Tasks

    Setting dependencies
  5. Slave node 2 Airflow multi-node architecture and HA Master node

    Airflow web server Airflow scheduler Slave node 1 Celery worker RabbitMQ queue Master node Airflow web server Airflow scheduler SQL Metastore … Celery worker Ansible server Active Standby Deploy DAGS/Apps > Deployment of DAGs and apps is done through Ansible > To provide HA • Health check of active server is monitored • If active server is down then standby server is activated • Also backup meta store and queues are maintained …
  6. ML pipeline & DAG scheduling

  7. Recommendation pipeline in use Generating input tables Phase select. Feature

    preprocessing Training Inference Validation & Formatting Stream output System view DAG view
  8. Multiple environments in one DAG prod phase Flow for prod

    environment Common part beta phase Common part Flow for beta environment Branch operator
  9. Frequent usage of dates ./preprocess.sh 20191024 … ./train.sh 20191024 …

    ./predict.sh 20191024 … Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output Airflow facilitates date/time usage and statelessness of tasks through Jinja templates
  10. Dealing with dates through templates ./train.sh 2019-10-24 ./train.sh 20191024 ./train.sh

    24-10-2019 ./train.sh 2019-10-21
  11. DAG scheduling details • first run: 2019/11/21 at 10:00 (*JST)

    • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed • first run: 2019/11/23 at 10:00 (*JST) • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > Airflow stores task execution metadata in SQL database in UTC format * Assuming default_timezone = system and host is in JST time zone
  12. Notifications

  13. Success | failure | retry event callbacks Notification of task

    state events > Task level callbacks for specific task > DAG level callbacks for all tasks in DAG
  14. Monitoring training & inference stages

  15. Monitor training stage of pipeline • Training and validation loss

    monitoring • If loss is invalid (e.g. inf, nan), then • Restart training task with readjusted parameters (e.g. decrease learning rate) > Training key metrics Invalid loss Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output
  16. Monitor and validate inference results • min/max/average of classification scores

    • Validation of target users/items (e.g. duplicates or abnormal coverage) • KL-div of distribution of recommended items (e.g. comparison with weekly average) > Monitor inference metrics Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output
  17. Propagate model signature to future tasks ffm.TH.201910241414_20191023 > Get context

    -> task instance • Push by key > Get context -> task instance • Pull by key and task_id of operator that pushed Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output
  18. Thanks!