Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modern Data Pipelines with Apache Airflow (Momentum Dev Con 2018)

Modern Data Pipelines with Apache Airflow (Momentum Dev Con 2018)

Original presentation on Google Slides - https://docs.google.com/presentation/d/1mCgDT7DEj2jsrr09Omm4lAspihmPXaA8r-dBuyPNT5U/edit?usp=sharing

---

Abstract:

Big data needs span both business users and developers across the world. Most classical ETL and BI tools attempted to cater to this hybrid demographic resulting in cluttered GUI tools that were inflexible, inextensible, and frustrating to use. Apache Airflow takes a different approach by representing tasks and config as Python code.

Airflow is a platform to programmatically author, schedule and monitor workflows composed of arbitrary tasks run on regular schedules. Airflow provides a platform for distributed task execution across complex workflows as directed acyclic graphs (DAGs) defined by code.

Built on top of Airflow, Astronomer provides a containerized Airflow service on Kubernetes as well as a variety of Airflow components and integrations to promote code reuse, extensibility, and modularity. The core of our stack is available as cloud hosted, on prem, and is also fully open source.

---

Bio: Taylor D. Edmiston

Taylor Edmiston is a senior software engineer experienced in designing and building backend systems from web apps to APIs to platforms for startups. He has software development experience at startups from multiple top accelerators including AngelPad, Techstars, and The Brandery. Currently, he's a developer on the core team at Astronomer.io working on the customer data platform that runs batch workflows via Airflow and clickstream pipelines via Kafka on top of Kubernetes. He's in the top 25% all time on Stack Overflow having reached over 500k fellow software developers, and the top 1% on Codewars.

On a personal note, he enjoys getting stamps in his passport and has traveled to 9 countries across 4 continents so far.

---

Bio: Andy Cooper

Andy Cooper is a Software Engineer who previously focused on ETL and Business Intelligence development. More recently he has applied those skills into building a Data Engineering Platform at Astronomer.

Outside of work Andy enjoys just about any outdoor activity, including climbing, hiking, biking and skiing.

Taylor Edmiston

April 19, 2018
Tweet

Other Decks in Programming

Transcript

  1. Modern Data Pipelines
    with Apache Airflow
    Andy Cooper & Taylor Edmiston @ Astronomer.io
    Momentum Dev Con 2018

    View Slide

  2. Taylor Edmiston
    ● Backend software engineer building the
    Airflow platform at Astronomer.io
    ● 9 years with Python, 6 years as a
    professional developer
    ● Top 20% all time on Stack Overflow with a
    reach of 750k developers
    ● Enjoys travel - 9 countries / 4 continents
    About Us
    Andy Cooper
    ● Data Engineer
    ● 6 years of experience developing software
    and data pipelines
    ● Began career developing traditional data
    warehouses with Microsoft stack
    ● Using Airflow since 1.7

    View Slide

  3. What is Astronomer?
    ● Astronomer is a data engineering platform built on Apache Airflow and clickstream analytics
    ● Building tools that make data engineers lives easier
    ● Seed-stage startup, founded ~3 years ago, located in Cincinnati (OTR)
    ● AngelPad #9 batch
    ● https://www.astronomer.io
    ● https://www.crunchbase.com/organization/astronomer

    View Slide

  4. What do we do?
    Airflow
    ● Astronomer Cloud (Managed Airflow)
    ○ Get up and running with Airflow quickly
    ● Astronomer Enterprise (docs)
    ○ Keep your data and workflows in your
    private cloud
    ○ Astronomer Spacecamp - Enterprise
    support & training available
    (https://www.astronomer.io/blog/announcin
    g-astronomer-spacecamp/)
    ● Astronomer Open (docs)
    ○ The core of our platform is open source —
    try our Docker images on your machine
    Clickstream
    ● A clickstream analytics pipeline and router
    for user events
    ● Client-side (web, native mobile) or
    server-side
    ● Not an analytics service! We integrate with
    50+
    ● Free tier
    ● astronomer.io/clickstream
    ● 2-min demo video -
    https://www.youtube.com/watch?v=ru7VM
    e5MXZk

    View Slide

  5. (~40 min) Outline
    ● (5 min) Intro
    ● (10 min) Part I - Airflow overview & concepts
    ● (10 min) Part II - Example DAGs
    ● Midpoint Q&A?
    ● (10 min) Part III - Getting started with Airflow + Astro CLI demo
    ● (5 min) Summary / Outro
    ● Q&A

    View Slide

  6. What We’ll Cover
    ● Airflow Concepts
    ● Getting Started with Airflow
    ● Astro CLI
    ● Preview and Discussion Of Airflow UI
    ● Q&A

    View Slide

  7. What is Apache Airflow?
    ● “Airflow is a platform to programmatically author, schedule and monitor
    workflows.”
    ● Open Source currently in the Apache Incubator phase
    ○ 7,500 stars
    ○ 4,000 commits
    ○ 400 contributors
    ● Written in Python
    ● Leverages Flask web framework

    View Slide

  8. Airflow Concepts

    View Slide

  9. What is a DAG?
    Directed Acyclic Graph

    View Slide

  10. Define Your Pipelines in
    Code

    View Slide

  11. A Centralized Web App for
    All Workflows

    View Slide

  12. ● A quick look into DAG and task progress
    ● Error Logging
    ● Connections & Variables
    ● Connection Pooling
    Web App Features

    View Slide

  13. Hooks and Operators

    View Slide

  14. ● An interface to an external system
    ● Often a wrapper for an API client
    ● Examples
    ○ DbApiHook
    ○ S3Hook
    ○ SlackHook
    Hooks

    View Slide

  15. ● Sensor Operators
    ○ S3KeySensor
    ○ S3PrefixSensor
    ○ HTTPSensor
    ● Action Operators
    ○ BashOperator
    ○ PythonOperator
    ○ EmailOperator
    ● Transfer Operators
    ○ SalesforceToRedshiftSchemaSync
    ○ SalesforceToS3
    Operators

    View Slide

  16. DAG Runs & Task
    Instances

    View Slide

  17. View Slide

  18. View Slide

  19. Dynamic DAGs

    View Slide

  20. Executors & Scaling

    View Slide

  21. ● SequentialExecutor
    ● LocalExecutor
    ○ No additional dependencies
    ○ Multi-threaded out of the box
    ● CeleryExecutor
    ● MesosExecutor
    ● KubernetesExecutor (future)
    Executors

    View Slide

  22. Plugins

    View Slide

  23. ● Extend the Airflow API
    ● Build new dashboards
    ● Create custom Hooks and Operators
    ● Astronomer maintains the most comprehensive collection of Airflow Plugins
    ○ github.com/airflow-plugins
    ● Code reuse, composition, good software engineering practices, etc
    ● Examples
    ○ Salesforce To Redshift Plugin
    ○ airflow-api-plugin
    ○ Airflow DAG Creation Manager Plugin
    What can a plugin do?

    View Slide

  24. Example DAGs

    View Slide

  25. ● GitHub stats DAG
    ● Clickstream Redshift loader DAG
    ○ ~200 million events per month from customer apps
    ○ ~2 million Airflow task instances per month
    ● https://github.com/airflow-plugins/Example-Airflow-DAGs
    DAG Examples

    View Slide

  26. Github Issue and Commit Tracking Ex.

    View Slide

  27. Clickstream Redshift DAG

    View Slide

  28. Clickstream Redshift DAG
    ● Your Website → Astronomer Clickstream → S3 → [S3 sensor → Redshift
    copy via Apache Spark]
    ● Dynamic DAGs configured via API → Scheduler (cached) → Variable

    View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. Astro CLI
    The fastest way to get started with Airflow

    View Slide

  33. How can I get started with Airflow?
    ● Source Code
    ○ https://github.com/astronomerio/astro-cli
    ● Install CLI
    ○ $ curl -sL https://install.astronomer.io | sudo bash
    ● Start a Project
    ○ $ mkdir test-project && cd test-project
    ○ $ astro airflow init
    ○ $ astro airflow start

    View Slide

  34. Takeaway
    ● Part I - Airflow overview & concepts
    ● Part II - Example DAGs
    ● Part III - Getting started with Airflow + Astro CLI demo

    View Slide

  35. ● Official
    ○ https://github.com/apache/incubator-airflow
    ○ https://airflow.apache.org
    ○ Airflow Dev Mailing List
    ○ Apache Airflow meetups
    ● Community
    ○ https://github.com/airflow-plugins
    ○ https://soundcloud.com/the-airflow-podcast
    ○ https://github.com/jghoman/awesome-apache-airflow
    ● Related Talks
    ○ https://blog.tedmiston.com/talks/
    Resources

    View Slide

  36. Contact Info
    ● Andy
    ○ https://twitter.com/andscoop
    ○ https://www.linkedin.com/in/andscoop/
    ○ https://andscoop.com/
    [email protected]
    ● Taylor
    ○ https://twitter.com/kicksopenminds
    ○ https://www.linkedin.com/in/tedmiston/
    ○ https://blog.tedmiston.com
    [email protected]

    View Slide