Airflow web server Airflow scheduler Slave node 1 Celery worker RabbitMQ queue Master node Airflow web server Airflow scheduler SQL Metastore … Celery worker Ansible server Active Standby Deploy DAGS/Apps > Deployment of DAGs and apps is done through Ansible > To provide HA • Health check of active server is monitored • If active server is down then standby server is activated • Also backup meta store and queues are maintained …
• execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > The Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed • first run: 2019/11/23 at 10:00 (*JST) • execution_date: 2019-11-20 01:00:00 (UTC) • ds: 2019-11-20 > Airflow stores task execution metadata in SQL database in UTC format * Assuming default_timezone = system and host is in JST time zone
monitoring • If loss is invalid (e.g. inf, nan), then • Restart training task with readjusted parameters (e.g. decrease learning rate) > Training key metrics Invalid loss Generating input tables Phase select. Feature preprocessing Training Inference Validation & Formatting Stream output