Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Automating data pipelines with Apache Airflow

Automating data pipelines with Apache Airflow

Data orchestration is the process of taking siloed data from multiple data storage locations, combining and organizing it, and making it available to your developers, data engineers, and data scientists. This enables businesses to automate and streamline data-driven decision making. Apache Airflow is an open source orchestration tool that helps you to programmatically create workflows in Python that help you run, schedule, monitor and mange data engineering pipelines - no more manually managing those cron jobs! In this session, we will take a look at the architecture of Apache Airflow, and then show you how to create and deploy a typical workflow. You will see how you can use the open source provider libraries to simplify your workflows when creating an end to end data pipeline. Expect lots of code and demos in this session. [15 min talk/presentation, 20-30 min demo]

Ricardo Sueiras

June 03, 2022
Tweet

More Decks by Ricardo Sueiras

Other Decks in Technology

Transcript

  1. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Automating data pipelines with Apache Airflow Ricardo Sueiras Developer Advocate for Open Source Amazon Web Services
  2. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. What is in your data pipeline? Move Clean Combine Filter Store Secure
  3. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. orchestration /ɔːkɪˈstreɪʃ(ə)n/ noun 1. the arrangement or scoring of music for orchestral performance. 2. the planning or coordination of the elements of a situation to produce a desired effect
  4. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Introducing Apache Airflow Apache Airflow
  5. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. workflow /ˈwəːkfləʊ/ noun 1. the sequence of steps (tasks) involved in moving from the beginning to the end of a working process
  6. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. DAG from airflow import DAG default_args = { 'owner': 'airflow', 'depends_on_past': False, 'email': ['[email protected]'], 'email_on_failure': False, 'email_on_retry': False } DAG_ID = ‘daily_dw_ingest’ dag = DAG( dag_id=DAG_ID, default_args=default_args, description='First Apache Airflow DAG', schedule_interval=None, start_date=days_ago(2), tags=['devcon','demo'], ) dag_id=daily_dw_ingest
  7. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Task move_file = BashOperator( task_id='move_current_file', bash_command="cd {work_dir} && mv {source_file} {destination_file}” dag=dag ) task_id=move_current_file
  8. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Task move_file = BashOperator( task_id='move_current_file', bash_command="cd {work_dir} && mv {source_file} {destination_file}” dag=dag ) import BashOperator PythonOperator DummyOperator … Operators
  9. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Operators BashOperator PythonOperator DummyOperator … Operators Sensors Hooks Use Airflow Connections
  10. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Operators • Airbyte • Alibaba • Amazon • Apache Beam • Apache Cassandra • Apache Drill • Apache Druid • Apache HDFS • Apache Hive • Apache Kylin • Apache Livy • Apache Pig • Apache Pinot • Apache Spark • Apache Sqoop • Asana • Celery • IBM Cloudant • Kubernetes • Databricks • Datadog • Dingding • Discord • Docker • Elasticsearch • Exasol • Facebook • File Transfer Protocol (FTP) • Github • Google • gRPC • Hashicorp • Hypertext Transfer Protocol (HTTP) • Influx DB • Internet Message Access Protocol (IMAP) • Java Database Connectivity (JDBC) • Jenkins • Jira • Microsoft Azure • Microsoft PowerShell Remoting Protocol (PSRP) • Microsoft SQL Server (MSSQL) • Windows Remote Management (WinRM) • MongoDB • MySQL • Neo4J • ODBC • OpenFaaS • Opsgenie • Oracle • Pagerduty • Papermill • Plexus • PostgreSQL • Presto • Qubole • Redis • Salesforce • Samba • Segment • Sendgrid • SFTP • Singularity • Slack • Snowflake • SQLite • SSH • Tableau • Telegram • Trino • Vertica • Yandex • Zendesk
  11. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. task_id=copy_data task_id=store_raw_data task_id=clean data task_id=process_data task_id=move_to_datawarehouse DAG from airflow import DAG default_args = { 'owner': 'airflow', 'depends_on_past': False, 'email': ['[email protected]'], 'email_on_failure': False, 'email_on_retry': False } DAG_ID = ‘daily_dw_ingest’ dag = DAG( dag_id=DAG_ID, default_args=default_args, description='First Apache Airflow DAG', schedule_interval=None, start_date=days_ago(2), tags=['devcon','demo'], ) Import python libraries Define standard settings Define workflow settings Define workflow name
  12. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Control flow Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. task_id=copy_data task_id=store_raw_data task_id=clean_data task_id=process_data task_id=move_to_datawarehouse copy_data >> store_raw_data copy_data >> clean_data >> process_data >> move_to_datawarehouse copy_data.set_downstream ([store_raw_data]) copy_data.set_downstream ([clean_data,store_raw_data,move_to_datawarehouse]) Defining task dependencies
  13. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deploying your DAGs
  14. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Airflow Scheduler and Workers
  15. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Scheduling our Workflows (DAGs) dag = DAG( dag_id="daily_dw_ingest ", schedule_interval=None, start_date=datetime.datetime(2022, 2, 1), catchup=False, tags=["example"], ) schedule_interval schedule_interval="*/10 * * * *” – every 10 min schedule_interval="0 */2 * * *” – every 2 hours schedule_interval="0 */1 * * *" - every hour schedule_interval="*/5 * * * *” – every 5 mins **New** To provide more scheduling flexibility, determining when a DAG should run is now done with Timetables.
  16. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Understanding how workflows (DAGs) run dag = DAG( dag_id="daily_dw_ingest ", schedule_interval="0 */1 * * *", start_date=datetime.datetime(2022, 2, 1), catchup=True, tags=["example"], ) Execution Date: 1st Feb, 2022, 01:00 Execution Date: 1st Feb, 2022, 02:00 Execution Date: 1st Feb, 2022, 03:00 …
  17. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Airflow Metadata
  18. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building our data pipeline Source and copy data Merge and consolidate data Clean data Ingest into data lake Copy into our data warehouse
  19. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building our data pipeline AWS Cloud Internet Development
  20. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building our data pipeline AWS Cloud Internet Development
  21. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Building our data pipeline AWS Cloud Internet Development
  22. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Via AWS Partner solutions Open source your way Through AWS services Directly
  23. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Self Managed Deploy Apache Airflow on AWS EC2 instances or on AWS Container orchestrators such as AWS ECS or EKS
  24. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. VPC AWS Cloud Amazon Redshift Cluster / VPC Managed Service Use AWS Managed Workflows for Apache Airflow – a fully managed service for Apache Airflow
  25. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Marketplace Find AWS Partners who can help you deploy Apache Airflow or who provide Apache Airflow managed services
  26. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Developing locally aws-mwaa-local-runner helps developers develop, debug, test and run workflows locally https://github.com/aws/aws-mwaa-local-runner
  27. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How AWS contributes to Apache Airflow No forks Supporting the project Upstream contributions
  28. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Find out more about contributing The Apache Airflow community makes contributing to this project both a welcoming and straight forward experience. https://dev.to/aws
  29. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Case studies https://aws.amazon.com/blogs/opensource/
  30. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Keep up to date with open source at AWS @AWSOpen https://aws.amazon.com/blogs/opensource/ https://dev.to/aws/
  31. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Thank you! Ricardo Sueiras Developer Advocate for Open Source Amazon Web Services. https://www.linkedin.com/in/ricardosueiras @094459
  32. © 2022, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Deepen your skills with digital learning on demand Access 500+ free digital courses and Learning Plans Earn an industry-recognized credential AWS Skill Builder AWS Certifications Explore resources with a variety of skill levels and 16+ languages to meet your learning needs Join the AWS Certified community and get exclusive benefits Receive Foundational, Associate, Professional, and Specialty certifications Train now Access new exam guides © 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Learn in-demand AWS Cloud skills
  33. Please complete the session survey © 2022, Amazon Web Services,

    Inc. or its affiliates. All rights reserved.