Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Airflow and EMR

AWS Airflow and EMR

Using Airflow and EMR for Data Lake Architecture

Alexey Novakov

August 02, 2021
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. AWS Airflow +
    EMR
    Airflow, Hudi, Spark, Glue Catalog, S3
    CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Alexey Novakov, Data Architect

    View Slide

  2. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Agenda
    M W A A E N V I R O N M E N T S E T U P
    D A G C O D E O V E R V I E W
    S C H E D U L I N G O R D E R + I T E M J O I N D A G
    S C H E D U L E R A L T E R N A T I V E S
    A I R F L O W O V E R V I E W
    2
    Amazon Managed
    Workflows for Apache
    Airflow

    View Slide

  3. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    AIRFLOW

    View Slide

  4. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    AIRFLOW IS AN ORCHESTRATOR FOR COMPLEX
    WORKFLOWS AND DATA PIPELINES.

    View Slide

  5. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Airflow: Quick Facts
    5
    • Developer by Airbnb and open-
    sourced in 2015
    • Since 2016 is in Apache Foundation
    • Several Airflow SaaS providers, incl.
    AWS
    • Airflow Workflow is represented as
    Directed Acyclic Graph (DAG)
    abstraction
    • Users design DAGs programmatically
    in Python (configuration as code)

    View Slide

  6. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Features
    • DAGs as Python Scripts
    • Monitoring (logs, status, execution
    time, etc.)
    • Scalable
    • Smart Scheduling (CRON, back-
    filling)
    • Dependency Management
    (upstream, downstream)
    • Resilience (retries)
    • Alerting
    • Service Level Agreement Timeout
    Notifications
    • Rich User Interface

    View Slide

  7. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Graph View
    7

    View Slide

  8. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Code View
    8

    View Slide

  9. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Airflow API
    9
    Some of the important
    APIs:
    - Connections
    - Variables (mutable)
    - XCom (inter-task
    communication)

    View Slide

  10. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Architecture
    10
    1. Scheduler
    2. Executor
    3. Webserver
    4. DAGs folder
    5. Meta database

    View Slide

  11. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Workload
    11
    Operators (pre-defined)
    Sensors
    Custom Python Functions

    View Slide

  12. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Operators => Tasks
    12
    with DAG("my-dag") as dag:
    ping = SimpleHttpOperator(endpoint="http://example.com/update/")
    email = EmailOperator(to="[email protected]", subject="Update complete")
    ping >> email

    View Slide

  13. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Custom Python Functions
    13
    dag = DAG(
    dag_id="example_template_as_python_object",
    schedule_interval=None,
    start_date=days_ago(2),
    render_template_as_native_obj=True,
    )
    def extract():
    data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}’
    return json.loads(data_string)
    def transform(order_data):
    print(type(order_data))
    for value in order_data.values():
    total_order_value += value
    return {"total_order_value": total_order_value}
    extract_task = PythonOperator(
    task_id="extract",
    python_callable=extract
    )
    transform_task = PythonOperator(
    task_id="transform", op_kwargs={"order_data": "{{ti.xcom_pull('extract')}}"},
    python_callable=transform
    )
    extract_task >> transform_task
    Functions are
    tasks to be run on
    different
    Airflow Workers

    View Slide

  14. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Operators
    14
    +67 packages as of August 2021

    View Slide

  15. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    MWAA ENVIRONMENT SETUP

    View Slide

  16. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    S3 Buckets
    16

    View Slide

  17. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Networking
    17

    View Slide

  18. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Networking
    18

    View Slide

  19. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Environment Class
    19

    View Slide

  20. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Permissions
    20
    IAM Role to access other
    AWS services by DAGs
    (EMR, S3, etc.)

    View Slide

  21. CONFIDENTIAL | © 2023 EPAM Systems, Inc. 21

    View Slide

  22. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    DEMO DAG

    View Slide

  23. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    EMR-Hudi DAG
    23
    1. Task: ingest data to Hudi tables (S3 raw-data bucket)
    2. Sensor: wait for ingestion to complete
    3. Task: join data via EMR Job and store Hudi table “joined”
    4. Sensor: wait for join to complete
    .. Let’s jump to the actual code of this DAG

    View Slide

  24. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    DAG Scheduling, Triggering
    24
    dag = DAG(
    "spark_emr_hudi",
    schedule_interval=None, # or for example: '0/10 * * * * *’
    dagrun_timeout=timedelta(minutes=60),
    default_args=args,
    user_defined_macros=user_defined_macros,
    max_active_runs=1,
    tags=["emr", "hudi"])
    $ airflow dags trigger --exec-date $executionDate $dagName -c '$conf'
    Option 1:
    Option 2:
    Trigger from UI

    View Slide

  25. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    DAG Result
    25

    View Slide

  26. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    DAG Result
    26

    View Slide

  27. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    SCHEDULER ALTERNATIVES

    View Slide

  28. CONFIDENTIAL | © 2023 EPAM Systems, Inc.
    Scheduler Options
    • Quite new, since 2020
    • Big open-source community
    (Annual Airflow Conference)
    • Lots of already implemented
    operators
    • Driven by Python scripts
    • Incremental loads via
    execution data and variables
    A I R F L O W - M WA A
    • AWS Steps Functions
    • Driven by Amazon States Language
    (JSON)
    • Hard to persist user state in a State
    Machine
    • Dagster
    • Driven by Python scripts and YAML
    configuration files
    • Similar concept to Airflow
    G LU E W O R K F L O W S
    28
    OT H E R S
    • Quite new Feature, since 2019
    • Tightly integrated with Glue
    • AWS proprietary tool
    • Driven by Python scripts plus
    • Incremental load via Glue
    Bookmarks

    View Slide