Using Airflow and EMR for Data Lake Architecture
AWS Airflow +EMRAirflow, Hudi, Spark, Glue Catalog, S3CONFIDENTIAL | © 2023 EPAM Systems, Inc.Alexey Novakov, Data Architect
View Slide
CONFIDENTIAL | © 2023 EPAM Systems, Inc.AgendaM W A A E N V I R O N M E N T S E T U PD A G C O D E O V E R V I E WS C H E D U L I N G O R D E R + I T E M J O I N D A GS C H E D U L E R A L T E R N A T I V E SA I R F L O W O V E R V I E W2Amazon ManagedWorkflows for ApacheAirflow
CONFIDENTIAL | © 2023 EPAM Systems, Inc.AIRFLOW
CONFIDENTIAL | © 2023 EPAM Systems, Inc.AIRFLOW IS AN ORCHESTRATOR FOR COMPLEXWORKFLOWS AND DATA PIPELINES.
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Airflow: Quick Facts5• Developer by Airbnb and open-sourced in 2015• Since 2016 is in Apache Foundation• Several Airflow SaaS providers, incl.AWS• Airflow Workflow is represented asDirected Acyclic Graph (DAG)abstraction• Users design DAGs programmaticallyin Python (configuration as code)
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Features• DAGs as Python Scripts• Monitoring (logs, status, executiontime, etc.)• Scalable• Smart Scheduling (CRON, back-filling)• Dependency Management(upstream, downstream)• Resilience (retries)• Alerting• Service Level Agreement TimeoutNotifications• Rich User Interface
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Graph View7
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Code View8
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Airflow API9Some of the importantAPIs:- Connections- Variables (mutable)- XCom (inter-taskcommunication)
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Architecture101. Scheduler2. Executor3. Webserver4. DAGs folder5. Meta database
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Workload11Operators (pre-defined)SensorsCustom Python Functions
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Operators => Tasks12with DAG("my-dag") as dag:ping = SimpleHttpOperator(endpoint="http://example.com/update/")email = EmailOperator(to="[email protected]", subject="Update complete")ping >> email
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Custom Python Functions13dag = DAG(dag_id="example_template_as_python_object",schedule_interval=None,start_date=days_ago(2),render_template_as_native_obj=True,)def extract():data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}’return json.loads(data_string)def transform(order_data):print(type(order_data))for value in order_data.values():total_order_value += valuereturn {"total_order_value": total_order_value}extract_task = PythonOperator(task_id="extract",python_callable=extract)transform_task = PythonOperator(task_id="transform", op_kwargs={"order_data": "{{ti.xcom_pull('extract')}}"},python_callable=transform)extract_task >> transform_taskFunctions aretasks to be run ondifferentAirflow Workers
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Operators14+67 packages as of August 2021
CONFIDENTIAL | © 2023 EPAM Systems, Inc.MWAA ENVIRONMENT SETUP
CONFIDENTIAL | © 2023 EPAM Systems, Inc.S3 Buckets16
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Networking17
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Networking18
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Environment Class19
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Permissions20IAM Role to access otherAWS services by DAGs(EMR, S3, etc.)
CONFIDENTIAL | © 2023 EPAM Systems, Inc. 21
CONFIDENTIAL | © 2023 EPAM Systems, Inc.DEMO DAG
CONFIDENTIAL | © 2023 EPAM Systems, Inc.EMR-Hudi DAG231. Task: ingest data to Hudi tables (S3 raw-data bucket)2. Sensor: wait for ingestion to complete3. Task: join data via EMR Job and store Hudi table “joined”4. Sensor: wait for join to complete.. Let’s jump to the actual code of this DAG
CONFIDENTIAL | © 2023 EPAM Systems, Inc.DAG Scheduling, Triggering24dag = DAG("spark_emr_hudi",schedule_interval=None, # or for example: '0/10 * * * * *’dagrun_timeout=timedelta(minutes=60),default_args=args,user_defined_macros=user_defined_macros,max_active_runs=1,tags=["emr", "hudi"])$ airflow dags trigger --exec-date $executionDate $dagName -c '$conf'Option 1:Option 2:Trigger from UI
CONFIDENTIAL | © 2023 EPAM Systems, Inc.DAG Result25
CONFIDENTIAL | © 2023 EPAM Systems, Inc.DAG Result26
CONFIDENTIAL | © 2023 EPAM Systems, Inc.SCHEDULER ALTERNATIVES
CONFIDENTIAL | © 2023 EPAM Systems, Inc.Scheduler Options• Quite new, since 2020• Big open-source community(Annual Airflow Conference)• Lots of already implementedoperators• Driven by Python scripts• Incremental loads viaexecution data and variablesA I R F L O W - M WA A• AWS Steps Functions• Driven by Amazon States Language(JSON)• Hard to persist user state in a StateMachine• Dagster• Driven by Python scripts and YAMLconfiguration files• Similar concept to AirflowG LU E W O R K F L O W S28OT H E R S• Quite new Feature, since 2019• Tightly integrated with Glue• AWS proprietary tool• Driven by Python scripts plus• Incremental load via GlueBookmarks