Description
Moving data through transformations and from one place to another is a big part of data science/eng. We’ve been using Airflow for several months at Clover Health and have learned a lot about its strengths and weaknesses. We will use this talk to give a practical introduction to Airflow that gives people the information they need to decide whether Airflow is right for them and how to get started.
Abstract
Airflow is a popular pipeline orchestration tool for Python that allows users to configure complex (or simple!) multi-system workflows that are executed in parallel across any number of workers. A single pipeline might contain bash, Python, and SQL operations. With dependencies specified between tasks, Airflow knows which ones it can run in parallel and which ones must run after others. Airflow is written in Python and users can add their own operators with custom functionality, doing anything Python can do.
At Clover Health, we’ve been pushing Airflow’s limits, digging into the source code, and contributing patches upstream. In this talk, we’ll cover the basics of Airflow so you can use what we’ve learned to start your Airflow journey on the right foot. This talk aims to answer questions such as: What is Airflow useful for? How do I get started? What do I need to know that’s not in the docs?
Bio
I have been a scientific Python developer since 2008. I’ve worked in atmospheric science, astronomy, urban planning, web applications, and healthcare. I maintain several open source Python libraries and am currently a data engineer at Clover Health.