Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing ML pipelines with Airflow

Managing ML pipelines with Airflow

Khalid Huseynov
LINE Plus AI Services Lab Data scientist/Software Engineer
https://linedevday.linecorp.com/jp/2019/sessions/S1-19

LINE DevDay 2019

November 20, 2019
Tweet

More Decks by LINE DevDay 2019

Other Decks in Technology

Transcript

  1. 2019 DevDay Managing ML Pipelines With Airflow > Khalid Huseynov

    > LINE Plus AI Services Lab Data scientist/Software Engineer
  2. Agenda > Introduction & architecture > ML pipeline & DAG

    scheduling > Notifications > Monitoring training & inference stages
  3. Slave node 2 Airflow multi-node architecture and HA Master node

    Airflow web server Airflow scheduler Slave node 1 Celery worker RabbitMQ queue Master node Airflow web server Airflow scheduler SQL Metastore … Celery worker Ansible server Active Standby Deploy DAGS/Apps > Deployment of DAGs and apps is done through Ansible > To provide HA • Health check of active server is monitored • If active server is down then standby server is activated • Also backup meta store and queues are maintained …
  4. Recommendation pipeline in use Generating input tables Phase select. Feature

    preprocessing Training Inference Validation & Formatting Stream output System view DAG view
  5. Multiple environments in one DAG prod phase Flow for prod

    environment Common part beta phase Common part Flow for beta environment Branch operator
  6. Frequent usage of dates ./preprocess.sh 20191024 … ./train.sh 20191024 …

    ./predict.sh 20191024 … Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output Airflow facilitates date/time usage and statelessness of tasks through Jinja templates
  7. DAG scheduling details • first run: 2019/11/21 at 10:00 (*JST)

    • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed • first run: 2019/11/23 at 10:00 (*JST) • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > Airflow stores task execution metadata in SQL database in UTC format * Assuming default_timezone = system and host is in JST time zone
  8. Success | failure | retry event callbacks Notification of task

    state events > Task level callbacks for specific task > DAG level callbacks for all tasks in DAG
  9. Monitor training stage of pipeline • Training and validation loss

    monitoring • If loss is invalid (e.g. inf, nan), then • Restart training task with readjusted parameters (e.g. decrease learning rate) > Training key metrics Invalid loss Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output
  10. Monitor and validate inference results • min/max/average of classification scores

    • Validation of target users/items (e.g. duplicates or abnormal coverage) • KL-div of distribution of recommended items (e.g. comparison with weekly average) > Monitor inference metrics Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output
  11. Propagate model signature to future tasks ffm.TH.201910241414_20191023 > Get context

    -> task instance • Push by key > Get context -> task instance • Pull by key and task_id of operator that pushed Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output