A A E N V I R O N M E N T S E T U P D A G C O D E O V E R V I E W S C H E D U L I N G O R D E R + I T E M J O I N D A G S C H E D U L E R A L T E R N A T I V E S A I R F L O W O V E R V I E W 2 Amazon Managed Workflows for Apache Airflow
5 • Developer by Airbnb and open- sourced in 2015 • Since 2016 is in Apache Foundation • Several Airflow SaaS providers, incl. AWS • Airflow Workflow is represented as Directed Acyclic Graph (DAG) abstraction • Users design DAGs programmatically in Python (configuration as code)
1. Task: ingest data to Hudi tables (S3 raw-data bucket) 2. Sensor: wait for ingestion to complete 3. Task: join data via EMR Job and store Hudi table “joined” 4. Sensor: wait for join to complete .. Let’s jump to the actual code of this DAG
Quite new, since 2020 • Big open-source community (Annual Airflow Conference) • Lots of already implemented operators • Driven by Python scripts • Incremental loads via execution data and variables A I R F L O W - M WA A • AWS Steps Functions • Driven by Amazon States Language (JSON) • Hard to persist user state in a State Machine • Dagster • Driven by Python scripts and YAML configuration files • Similar concept to Airflow G LU E W O R K F L O W S 28 OT H E R S • Quite new Feature, since 2019 • Tightly integrated with Glue • AWS proprietary tool • Driven by Python scripts plus • Incremental load via Glue Bookmarks