Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Airflow in production: the good, the bad ...

Using Airflow in production: the good, the bad and the ugly

Sharing our experiences in using Airflow in production at Platform Lunar.

Saulius Grigaliunas

April 11, 2017
Tweet

More Decks by Saulius Grigaliunas

Other Decks in Programming

Transcript

  1. Apache Airflow Airbnb, Agari, allegro.pl, AltX, Apigee, Astronomer, Auth0, BandwidthX,

    Bellhops, BlaBlaCar, Bloc, BlueApron, Blue Yonder, Celect, Change.org, Children's Hospital of Philadelphia Division of Genomic Diagnostics, City of San Diego, Clairvoyant, Clover Health, Chartboost, Cotap, Digital First Media, Easy Taxi, FreshBooks, Gentner Lab, Glassdoor, GovTech GDS, Gusto, Handshake, Handy, HBO, HelloFresh, Holimetrix, Hootsuite, IFTTT, iHeartRadio, ING, Jampp, Kiwi.com, Kogan.com, Lemann Foundation, LendUp, liligo, LingoChamp, Lucid, Lumos Labs, Lyft, Madrone, Markovian, Mercadoni, MiNODES, MFG Labs, mytaxi, Nerdwallet, OfferUp, OneFineStay, Open Knowledge International, PayPal, Postmates, Qubole, Scaleway, Sense360, Shopkick, Sidecar, SimilarWeb, SmartNews, Spotify, Stackspace, Stripe, Thumbtack, T2 Systems, Vente-Exclusive.com, Vnomics, WePay, WeTransfer, Whistle Labs, WiseBanyan, Wooga, Xoom, Yahoo!, Zapier, Zendesk, Zenly
  2. The setup • 22 DAGs/workflows • 4500 tasks run daily

    • Various intervals: hourly, daily, every 3 hours, every 6 hours, every 5 minutes • Airflow instance runs on AWS EC2, uses Postgres RDS db • Executes tasks: locally (Airflow instance), via Docker, via AWS Elastic Container Service, via AWS Elastic Mapreduce
  3. The good: vast operator support • BashOperator • DockerOperator •

    EmailOperator • HiveOperator • HttpOperator • JdbcOperator • MssqlOperator • MysqlOperator • OracleOperator • PigOperator • PostgresOperator • SqliteOperator • BigQueryOperator • DatabricksOperator • EmrOperator • EcsOperator • JiraOperator • HipChatOperator • SqoopOperator • SshExecuteOperator • SlackOperator • VerticaOperator
  4. The good: other features • Task pools - limit amount

    of running tasks • Variables - set shared variables (or secrets) via UI or environment variables, use in DAGs later • Service level agreements - know when things did not run or took too long
  5. The bad & ugly: DAG deployment • Basically no API

    in 2017 • Solution: • Gitlab CI script pushes to S3 bucket • Airflow instance uses cron to pull from S3 bucket * * * * * aws --region=eu-central-1 s3 sync s3://bucket /var/lib/ airflow/airflow/dags --exact-timestamps --exclude '*.pyc' --delete
  6. The bad & ugly - scheduler restarts Bug 1286825 -

    Airflow scheduler stopped working silently (https:// bug623317.bugzilla.mozilla.org/show_bug.cgi?id=1286825) > Apparently we are not the only ones experiencing this issue. A workaround is to restart the scheduler "frequently". > It looks like the way this is typically handled is to set a limit on the number of runs the scheduler will process before stopping, then have some supervisor keep restarting it. */5 * * * * service airflow-scheduler restart
  7. The bad & ugly - updated DAG runs, but UI

    does not reflect changes [AIRFLOW-276] Refresh stale dags - https://github.com/ apache/incubator-airflow/pull/1621 > Parsing and executing dag files can be slow, since they are python scripts. Hence, we cannot reload the dags folder on every request. The current code deals with this by only loading the dags folder once on startup, so it doesn't pick up new changes. worker_refresh_batch_size = 1 worker_refresh_interval = 30
  8. The bad & ugly - my server load is 57

    but I only have two CPU cores $ ps aux | grep airflow-scheduler | wc -l 609
  9. The bad & ugly - my server load is 57

    but I only have two CPU cores [Unit] Description=Airflow scheduler daemon After=network.target postgresql.service mysql.service redis.service rabbitmq- server.service Wants=postgresql.service mysql.service redis.service rabbitmq-server.service [Service] EnvironmentFile=/etc/sysconfig/airflow User=airflow Group=airflow Type=forking simple ExecStart=/bin/airflow scheduler Restart=always RestartSec=5s [Install] WantedBy=multi-user.target Systemd service config:
  10. The bad & ugly - my jobs are killed randomly

    Does Airflow restart affect current running jobs? > If you only restart the airflow webserver/scheduler processes then the running jobs are not affected. However restarting the worker process kills the job (killed as zombie - http://airflow.incubator.apache.org/ concepts.html#zombies-undeads) and then it may or may not be retried accordingly to the dag/task rules. http://stackoverflow.com/questions/39021636/does-airflow-restart-affect-current-running-jobs $ airflow scheduler -n=1, --num_runs=1 Set the number of runs to execute before exiting. 5 scheduler instances +
  11. The bad & ugly - the deadlock • Elastic MapReduce

    (EMR) runs tasks sequentially • Airflow waits until submitted task finishes using a sensor • E.g.: EmrAddStepsOperator + EmrStepSensor • Split scheduling & execution - use distributed executor (CeleryExecutor)?
  12. The bad & ugly - LatestOnlyOperator + • LatestOnlyOperator -

    runs the last/latest DAG instance • What if you schedule for 5 minute intervals?
  13. The bad & ugly - maintenance A series of DAGs/Workflows

    to help maintain the operation of Airflow db-cleanup A maintenance workflow that you can deploy into Airflow to periodically clean out the DagRun and TaskInstance DB entries to avoid having too much data in your Airflow MetaStore. kill-halted-tasks A maintenance workflow that you can deploy into Airflow to periodically kill off tasks that are running in the background that don't correspond to a running task in the DB. log-cleanup A maintenance workflow that you can deploy into Airflow to periodically clean out the task logs to avoid those getting too big. https://github.com/teamclairvoyant/airflow-maintenance-dags
  14. We are doing it wrong • Airflow is doing traditional

    ETL • Airflow manages task state for you • Airflow should be good for incremental processing not batch rewrites