Slide 1

Slide 1 text

S U M M I T B E R L I N

Slide 2

Slide 2 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automating data pipelines with Apache Airflow Ricardo Sueiras Developer Advocate for Open Source Amazon Web Services

Slide 3

Slide 3 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is in your data pipeline? Move Clean Combine Filter Store Secure

Slide 4

Slide 4 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. orchestration /ɔːkɪˈstreɪʃ(ə)n/ noun 1. the arrangement or scoring of music for orchestral performance. 2. the planning or coordination of the elements of a situation to produce a desired effect

Slide 5

Slide 5 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Introducing Apache Airflow Apache Airflow

Slide 6

Slide 6 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. workflow /ˈwəːkfləʊ/ noun 1. the sequence of steps (tasks) involved in moving from the beginning to the end of a working process

Slide 7

Slide 7 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. DAG from airflow import DAG default_args = { 'owner': 'airflow', 'depends_on_past': False, 'email': ['[email protected]'], 'email_on_failure': False, 'email_on_retry': False } DAG_ID = ‘daily_dw_ingest’ dag = DAG( dag_id=DAG_ID, default_args=default_args, description='First Apache Airflow DAG', schedule_interval=None, start_date=days_ago(2), tags=['devcon','demo'], ) dag_id=daily_dw_ingest

Slide 8

Slide 8 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Task move_file = BashOperator( task_id='move_current_file', bash_command="cd {work_dir} && mv {source_file} {destination_file}” dag=dag ) task_id=move_current_file

Slide 9

Slide 9 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Task move_file = BashOperator( task_id='move_current_file', bash_command="cd {work_dir} && mv {source_file} {destination_file}” dag=dag ) import BashOperator PythonOperator DummyOperator … Operators

Slide 10

Slide 10 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Operators BashOperator PythonOperator DummyOperator … Operators Sensors Hooks Use Airflow Connections

Slide 11

Slide 11 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Operators • Airbyte • Alibaba • Amazon • Apache Beam • Apache Cassandra • Apache Drill • Apache Druid • Apache HDFS • Apache Hive • Apache Kylin • Apache Livy • Apache Pig • Apache Pinot • Apache Spark • Apache Sqoop • Asana • Celery • IBM Cloudant • Kubernetes • Databricks • Datadog • Dingding • Discord • Docker • Elasticsearch • Exasol • Facebook • File Transfer Protocol (FTP) • Github • Google • gRPC • Hashicorp • Hypertext Transfer Protocol (HTTP) • Influx DB • Internet Message Access Protocol (IMAP) • Java Database Connectivity (JDBC) • Jenkins • Jira • Microsoft Azure • Microsoft PowerShell Remoting Protocol (PSRP) • Microsoft SQL Server (MSSQL) • Windows Remote Management (WinRM) • MongoDB • MySQL • Neo4J • ODBC • OpenFaaS • Opsgenie • Oracle • Pagerduty • Papermill • Plexus • PostgreSQL • Presto • Qubole • Redis • Salesforce • Samba • Segment • Sendgrid • SFTP • Singularity • Slack • Snowflake • SQLite • SSH • Tableau • Telegram • Trino • Vertica • Yandex • Zendesk

Slide 12

Slide 12 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. task_id=copy_data task_id=store_raw_data task_id=clean data task_id=process_data task_id=move_to_datawarehouse DAG from airflow import DAG default_args = { 'owner': 'airflow', 'depends_on_past': False, 'email': ['[email protected]'], 'email_on_failure': False, 'email_on_retry': False } DAG_ID = ‘daily_dw_ingest’ dag = DAG( dag_id=DAG_ID, default_args=default_args, description='First Apache Airflow DAG', schedule_interval=None, start_date=days_ago(2), tags=['devcon','demo'], ) Import python libraries Define standard settings Define workflow settings Define workflow name

Slide 13

Slide 13 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control flow Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. Task from airflow import DAG from datetime import datetime from airflow.providers.amazon.aws.operators.ecs import ECSOperator default_args = { 'owner': 'ubuntu', 'start_date': datetime(2019, 8, 14), 'retry_delay': timedelta(seconds=60*60) } .. .. .. task_id=copy_data task_id=store_raw_data task_id=clean_data task_id=process_data task_id=move_to_datawarehouse copy_data >> store_raw_data copy_data >> clean_data >> process_data >> move_to_datawarehouse copy_data.set_downstream ([store_raw_data]) copy_data.set_downstream ([clean_data,store_raw_data,move_to_datawarehouse]) Defining task dependencies

Slide 14

Slide 14 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploying your DAGs

Slide 15

Slide 15 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Airflow Scheduler and Workers

Slide 16

Slide 16 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scheduling our Workflows (DAGs) dag = DAG( dag_id="daily_dw_ingest ", schedule_interval=None, start_date=datetime.datetime(2022, 2, 1), catchup=False, tags=["example"], ) schedule_interval schedule_interval="*/10 * * * *” – every 10 min schedule_interval="0 */2 * * *” – every 2 hours schedule_interval="0 */1 * * *" - every hour schedule_interval="*/5 * * * *” – every 5 mins **New** To provide more scheduling flexibility, determining when a DAG should run is now done with Timetables.

Slide 17

Slide 17 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Understanding how workflows (DAGs) run dag = DAG( dag_id="daily_dw_ingest ", schedule_interval="0 */1 * * *", start_date=datetime.datetime(2022, 2, 1), catchup=True, tags=["example"], ) Execution Date: 1st Feb, 2022, 01:00 Execution Date: 1st Feb, 2022, 02:00 Execution Date: 1st Feb, 2022, 03:00 …

Slide 18

Slide 18 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Airflow Metadata

Slide 19

Slide 19 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Airflow UI

Slide 20

Slide 20 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo

Slide 21

Slide 21 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building our data pipeline Source and copy data Merge and consolidate data Clean data Ingest into data lake Copy into our data warehouse

Slide 22

Slide 22 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building our data pipeline AWS Cloud Internet Development

Slide 23

Slide 23 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building our data pipeline AWS Cloud Internet Development

Slide 24

Slide 24 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building our data pipeline AWS Cloud Internet Development

Slide 25

Slide 25 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo

Slide 26

Slide 26 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Via AWS Partner solutions Open source your way Through AWS services Directly

Slide 27

Slide 27 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Self Managed Deploy Apache Airflow on AWS EC2 instances or on AWS Container orchestrators such as AWS ECS or EKS

Slide 28

Slide 28 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. VPC AWS Cloud Amazon Redshift Cluster / VPC Managed Service Use AWS Managed Workflows for Apache Airflow – a fully managed service for Apache Airflow

Slide 29

Slide 29 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Marketplace Find AWS Partners who can help you deploy Apache Airflow or who provide Apache Airflow managed services

Slide 30

Slide 30 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Developing locally aws-mwaa-local-runner helps developers develop, debug, test and run workflows locally https://github.com/aws/aws-mwaa-local-runner

Slide 31

Slide 31 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. How AWS contributes to Apache Airflow No forks Supporting the project Upstream contributions

Slide 32

Slide 32 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Find out more about contributing The Apache Airflow community makes contributing to this project both a welcoming and straight forward experience. https://dev.to/aws

Slide 33

Slide 33 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resources

Slide 34

Slide 34 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Case studies https://aws.amazon.com/blogs/opensource/

Slide 35

Slide 35 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Code resources

Slide 36

Slide 36 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Keep up to date with open source at AWS @AWSOpen https://aws.amazon.com/blogs/opensource/ https://dev.to/aws/

Slide 37

Slide 37 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! Ricardo Sueiras Developer Advocate for Open Source Amazon Web Services. https://www.linkedin.com/in/ricardosueiras @094459

Slide 38

Slide 38 text

© 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deepen your skills with digital learning on demand Access 500+ free digital courses and Learning Plans Earn an industry-recognized credential AWS Skill Builder AWS Certifications Explore resources with a variety of skill levels and 16+ languages to meet your learning needs Join the AWS Certified community and get exclusive benefits Receive Foundational, Associate, Professional, and Specialty certifications Train now Access new exam guides © 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved. Learn in-demand AWS Cloud skills

Slide 39

Slide 39 text

Please complete the session survey © 2022, Amazon Web Services, Inc. or its affiliates. All rights reserved.