Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Airflow - DAG e Dintorni

Airflow - DAG e Dintorni

Airflow is a job orchestrator useful for performing ETL tasks or custom mapreduce jobs. We will see how it works and how to configure it with AWS on EMR EC2 instance.

Francesco Marchitelli

December 22, 2022
Tweet

More Decks by Francesco Marchitelli

Other Decks in Technology

Transcript

  1. AIRFLOW
    DAG E DINTORNI

    View full-size slide

  2. Ing. FRANCESCO MARCHITELLI
    https://francescomarchitelli.com

    View full-size slide

  3. Agenda
    • Data Pipeline
    • Data Architecture
    • DAG
    • Airflow
    • EMR
    • Let’s Code

    View full-size slide

  4. What is data pipeline?
    Series of steps or actions to move and combine data from
    various sources for analysis or visualization

    View full-size slide

  5. Orchestration Workflow
    The planning or coordination of
    the elements of a situation to
    produce a desired effect.
    The sequence of steps (tasks)
    involved in moving from the
    beginning to the end of a working
    process.

    View full-size slide

  6. Traditional Data Architecture
    Characteristics:
    • Schema-on-write
    • ETL
    • High-cost storage

    View full-size slide

  7. Modern Data Architecture
    Characteristics:
    • ELT
    • Distributed computing
    • Schema-on-read
    • Lower-cost storage

    View full-size slide

  8. Data Lake Data Warehouse
    • Centralized repository
    • Raw format
    • Schemaless
    • Any scale
    • Centralized repository
    • Transformed
    • Single consistent schema
    • Minimum data growth

    View full-size slide

  9. Data Platform Architecture

    View full-size slide

  10. Python-based workflow management framework to
    automate scripts in order to perform tasks.
    It’s extendable and provides a good monitoring.

    View full-size slide

  11. DAG
    Directed Acyclic Graph

    View full-size slide

  12. DAG Overview

    View full-size slide

  13. DAG Run View

    View full-size slide

  14. Task Duration

    View full-size slide

  15. DAG as a code

    View full-size slide

  16. Airflow Operator Support
    • BashOperator
    • DockerOperator
    • EmailOperator
    • HiveOperator
    • HttpOperator
    • JdbcOperator
    • MssqlOperator
    • MysqlOperator
    • OracleOperator
    • PigOperator
    • PostgresOperator
    • SqliteOperator
    • BigQueryOperator
    • DatabricksOperator
    • EmrOperator
    • EcsOperator
    • JiraOperator
    • HipChatOperator
    • SqoopOperator
    • SshExecuteOperator
    • SlackOperator
    • VerticaOperator

    View full-size slide

  17. Other Features
    • Task pools: limit amount of running tasks
    • Variables: set shared variables (or secrets) via UI or environment variables, use in DAGs later
    • Service level agreements (SLA): know when things did not run or took too long

    View full-size slide

  18. Airflow Execution Schema

    View full-size slide

  19. Schedule Interval
    • schedule_interval="*/10 * * * *” – every 10 min
    • schedule_interval="0 */2 * * *” – every 2 hours
    • schedule_interval="0 */1 * * *" - every hour
    • schedule_interval="*/5 * * * *” – every 5 mins

    View full-size slide

  20. AWS EMR Overview
    Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as
    Apache Hadoop and Apache Spark.
    Storage:
    • S3
    • HDFS
    • Local Disk
    Cluster Resource Management:
    • YARN
    Data Processing Framework:
    • MapReduce
    • Spark

    View full-size slide

  21. What does EMR do?
    The name EMR is an amalgamation for Elastic and MapReduce. Elastic refers
    to Elastic Cluster, better known as EC2.
    Apache MapReduce is both a programming paradigm and a set of Java SDKs,
    in particular these two Java classes:
    1. apache.hadoop.mapreduce.Mapper;
    2. apache.hadoop.mapreduce.Reducer;

    View full-size slide

  22. MapReduce
    The concept of Map is
    common to most
    programming languages.
    It means to run some
    function or some collection
    of data.
    Reduce means to count, sum,
    or otherwise create a subset
    of that now reduced data.

    View full-size slide

  23. What does EMR do?
    These run MapReduce operations and then optionally save the results to an
    Apache Hadoop Distributed File System (HDFS).
    MapReduce is a little bit old-fashioned, since Apache Spark does the same
    thing as that Hadoop-centric approach, but in a more efficient way. That’s
    probably why EMR has both products.

    View full-size slide

  24. EMR Components

    View full-size slide

  25. EMR Operator and Sensors
    EmrCreateJobFlowOperator creates the job.
    EmrStepSensor sets up monitoring via the web page.
    EmrTerminateJobFlowOperator removes the cluster.

    View full-size slide

  26. How to install Airflow
    Airflow is easy to install. EMR takes more steps, which is one reason
    why you might want to use Airflow.
    Beyond the initial setup, however, Amazon makes EMR cluster creation
    easier the second time you use it by saving a script that you can run
    with the Amazon command line interface (CLI).

    View full-size slide

  27. Airflow Setup
    You basically source a Python environment (e.g., source
    py372/bin/activate, if using virtualenv) then run this to install Airflow,
    which is nothing more than a Python package:
    export AIRFLOW_HOME=~/airflow
    pip install apache-airflow
    airflow db init

    View full-size slide

  28. Airflow Setup
    Then you create a user.
    airflow users create \
    --username fmar \
    --firstname francesco \
    --lastname marchitelli \
    --role Admin \
    --email [email protected]
    Then you start the web server
    interface, using any available
    port.
    airflow webserver --port 7777

    View full-size slide

  29. LET’S CODE

    View full-size slide

  30. Ing. FRANCESCO MARCHITELLI
    https://francescomarchitelli.com

    View full-size slide