Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Airflow and EMR

AWS Airflow and EMR

Using Airflow and EMR for Data Lake Architecture

Alexey Novakov

August 02, 2021
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. AWS Airflow + EMR Airflow, Hudi, Spark, Glue Catalog, S3

    CONFIDENTIAL | © 2023 EPAM Systems, Inc. Alexey Novakov, Data Architect
  2. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Agenda M W

    A A E N V I R O N M E N T S E T U P D A G C O D E O V E R V I E W S C H E D U L I N G O R D E R + I T E M J O I N D A G S C H E D U L E R A L T E R N A T I V E S A I R F L O W O V E R V I E W 2 Amazon Managed Workflows for Apache Airflow
  3. CONFIDENTIAL | © 2023 EPAM Systems, Inc. AIRFLOW IS AN

    ORCHESTRATOR FOR COMPLEX WORKFLOWS AND DATA PIPELINES.
  4. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Airflow: Quick Facts

    5 • Developer by Airbnb and open- sourced in 2015 • Since 2016 is in Apache Foundation • Several Airflow SaaS providers, incl. AWS • Airflow Workflow is represented as Directed Acyclic Graph (DAG) abstraction • Users design DAGs programmatically in Python (configuration as code)
  5. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Features • DAGs

    as Python Scripts • Monitoring (logs, status, execution time, etc.) • Scalable • Smart Scheduling (CRON, back- filling) • Dependency Management (upstream, downstream) • Resilience (retries) • Alerting • Service Level Agreement Timeout Notifications • Rich User Interface
  6. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Airflow API 9

    Some of the important APIs: - Connections - Variables (mutable) - XCom (inter-task communication)
  7. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Architecture 10 1.

    Scheduler 2. Executor 3. Webserver 4. DAGs folder 5. Meta database
  8. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Workload 11 Operators

    (pre-defined) Sensors Custom Python Functions
  9. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Operators => Tasks

    12 with DAG("my-dag") as dag: ping = SimpleHttpOperator(endpoint="http://example.com/update/") email = EmailOperator(to="[email protected]", subject="Update complete") ping >> email
  10. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Custom Python Functions

    13 dag = DAG( dag_id="example_template_as_python_object", schedule_interval=None, start_date=days_ago(2), render_template_as_native_obj=True, ) def extract(): data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}’ return json.loads(data_string) def transform(order_data): print(type(order_data)) for value in order_data.values(): total_order_value += value return {"total_order_value": total_order_value} extract_task = PythonOperator( task_id="extract", python_callable=extract ) transform_task = PythonOperator( task_id="transform", op_kwargs={"order_data": "{{ti.xcom_pull('extract')}}"}, python_callable=transform ) extract_task >> transform_task Functions are tasks to be run on different Airflow Workers
  11. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Permissions 20 IAM

    Role to access other AWS services by DAGs (EMR, S3, etc.)
  12. CONFIDENTIAL | © 2023 EPAM Systems, Inc. EMR-Hudi DAG 23

    1. Task: ingest data to Hudi tables (S3 raw-data bucket) 2. Sensor: wait for ingestion to complete 3. Task: join data via EMR Job and store Hudi table “joined” 4. Sensor: wait for join to complete .. Let’s jump to the actual code of this DAG
  13. CONFIDENTIAL | © 2023 EPAM Systems, Inc. DAG Scheduling, Triggering

    24 dag = DAG( "spark_emr_hudi", schedule_interval=None, # or for example: '0/10 * * * * *’ dagrun_timeout=timedelta(minutes=60), default_args=args, user_defined_macros=user_defined_macros, max_active_runs=1, tags=["emr", "hudi"]) $ airflow dags trigger --exec-date $executionDate $dagName -c '$conf' Option 1: Option 2: Trigger from UI
  14. CONFIDENTIAL | © 2023 EPAM Systems, Inc. Scheduler Options •

    Quite new, since 2020 • Big open-source community (Annual Airflow Conference) • Lots of already implemented operators • Driven by Python scripts • Incremental loads via execution data and variables A I R F L O W - M WA A • AWS Steps Functions • Driven by Amazon States Language (JSON) • Hard to persist user state in a State Machine • Dagster • Driven by Python scripts and YAML configuration files • Similar concept to Airflow G LU E W O R K F L O W S 28 OT H E R S • Quite new Feature, since 2019 • Tightly integrated with Glue • AWS proprietary tool • Driven by Python scripts plus • Incremental load via Glue Bookmarks