Slide 1

Slide 1 text

AWS Airflow + EMR Airflow, Hudi, Spark, Glue Catalog, S3 CONFIDENTIAL | © 2023 EPAM Systems, Inc. Alexey Novakov, Data Architect

Slide 2

Slide 2 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Agenda M W A A E N V I R O N M E N T S E T U P D A G C O D E O V E R V I E W S C H E D U L I N G O R D E R + I T E M J O I N D A G S C H E D U L E R A L T E R N A T I V E S A I R F L O W O V E R V I E W 2 Amazon Managed Workflows for Apache Airflow

Slide 3

Slide 3 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. AIRFLOW

Slide 4

Slide 4 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. AIRFLOW IS AN ORCHESTRATOR FOR COMPLEX WORKFLOWS AND DATA PIPELINES.

Slide 5

Slide 5 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Airflow: Quick Facts 5 • Developer by Airbnb and open- sourced in 2015 • Since 2016 is in Apache Foundation • Several Airflow SaaS providers, incl. AWS • Airflow Workflow is represented as Directed Acyclic Graph (DAG) abstraction • Users design DAGs programmatically in Python (configuration as code)

Slide 6

Slide 6 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Features • DAGs as Python Scripts • Monitoring (logs, status, execution time, etc.) • Scalable • Smart Scheduling (CRON, back- filling) • Dependency Management (upstream, downstream) • Resilience (retries) • Alerting • Service Level Agreement Timeout Notifications • Rich User Interface

Slide 7

Slide 7 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Graph View 7

Slide 8

Slide 8 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Code View 8

Slide 9

Slide 9 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Airflow API 9 Some of the important APIs: - Connections - Variables (mutable) - XCom (inter-task communication)

Slide 10

Slide 10 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Architecture 10 1. Scheduler 2. Executor 3. Webserver 4. DAGs folder 5. Meta database

Slide 11

Slide 11 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Workload 11 Operators (pre-defined) Sensors Custom Python Functions

Slide 12

Slide 12 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Operators => Tasks 12 with DAG("my-dag") as dag: ping = SimpleHttpOperator(endpoint="http://example.com/update/") email = EmailOperator(to="[email protected]", subject="Update complete") ping >> email

Slide 13

Slide 13 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Custom Python Functions 13 dag = DAG( dag_id="example_template_as_python_object", schedule_interval=None, start_date=days_ago(2), render_template_as_native_obj=True, ) def extract(): data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}’ return json.loads(data_string) def transform(order_data): print(type(order_data)) for value in order_data.values(): total_order_value += value return {"total_order_value": total_order_value} extract_task = PythonOperator( task_id="extract", python_callable=extract ) transform_task = PythonOperator( task_id="transform", op_kwargs={"order_data": "{{ti.xcom_pull('extract')}}"}, python_callable=transform ) extract_task >> transform_task Functions are tasks to be run on different Airflow Workers

Slide 14

Slide 14 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Operators 14 +67 packages as of August 2021

Slide 15

Slide 15 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. MWAA ENVIRONMENT SETUP

Slide 16

Slide 16 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. S3 Buckets 16

Slide 17

Slide 17 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Networking 17

Slide 18

Slide 18 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Networking 18

Slide 19

Slide 19 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Environment Class 19

Slide 20

Slide 20 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Permissions 20 IAM Role to access other AWS services by DAGs (EMR, S3, etc.)

Slide 21

Slide 21 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. 21

Slide 22

Slide 22 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DEMO DAG

Slide 23

Slide 23 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR-Hudi DAG 23 1. Task: ingest data to Hudi tables (S3 raw-data bucket) 2. Sensor: wait for ingestion to complete 3. Task: join data via EMR Job and store Hudi table “joined” 4. Sensor: wait for join to complete .. Let’s jump to the actual code of this DAG

Slide 24

Slide 24 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DAG Scheduling, Triggering 24 dag = DAG( "spark_emr_hudi", schedule_interval=None, # or for example: '0/10 * * * * *’ dagrun_timeout=timedelta(minutes=60), default_args=args, user_defined_macros=user_defined_macros, max_active_runs=1, tags=["emr", "hudi"]) $ airflow dags trigger --exec-date $executionDate $dagName -c '$conf' Option 1: Option 2: Trigger from UI

Slide 25

Slide 25 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DAG Result 25

Slide 26

Slide 26 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. DAG Result 26

Slide 27

Slide 27 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. SCHEDULER ALTERNATIVES

Slide 28

Slide 28 text

CONFIDENTIAL | © 2023 EPAM Systems, Inc. Scheduler Options • Quite new, since 2020 • Big open-source community (Annual Airflow Conference) • Lots of already implemented operators • Driven by Python scripts • Incremental loads via execution data and variables A I R F L O W - M WA A • AWS Steps Functions • Driven by Amazon States Language (JSON) • Hard to persist user state in a State Machine • Dagster • Driven by Python scripts and YAML configuration files • Similar concept to Airflow G LU E W O R K F L O W S 28 OT H E R S • Quite new Feature, since 2019 • Tightly integrated with Glue • AWS proprietary tool • Driven by Python scripts plus • Incremental load via Glue Bookmarks