Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Airflow - DAG e Dintorni

Airflow - DAG e Dintorni

Airflow is a job orchestrator useful for performing ETL tasks or custom mapreduce jobs. We will see how it works and how to configure it with AWS on EMR EC2 instance.

Francesco Marchitelli

December 22, 2022
Tweet

More Decks by Francesco Marchitelli

Other Decks in Technology

Transcript

  1. AIRFLOW
    DAG E DINTORNI

    View Slide

  2. Ing. FRANCESCO MARCHITELLI
    https://francescomarchitelli.com

    View Slide

  3. Agenda
    • Data Pipeline
    • Data Architecture
    • DAG
    • Airflow
    • EMR
    • Let’s Code

    View Slide

  4. What is data pipeline?
    Series of steps or actions to move and combine data from
    various sources for analysis or visualization

    View Slide

  5. Orchestration Workflow
    The planning or coordination of
    the elements of a situation to
    produce a desired effect.
    The sequence of steps (tasks)
    involved in moving from the
    beginning to the end of a working
    process.

    View Slide

  6. Traditional Data Architecture
    Characteristics:
    • Schema-on-write
    • ETL
    • High-cost storage

    View Slide

  7. Modern Data Architecture
    Characteristics:
    • ELT
    • Distributed computing
    • Schema-on-read
    • Lower-cost storage

    View Slide

  8. Data Lake Data Warehouse
    • Centralized repository
    • Raw format
    • Schemaless
    • Any scale
    • Centralized repository
    • Transformed
    • Single consistent schema
    • Minimum data growth

    View Slide

  9. Data Platform Architecture

    View Slide

  10. Python-based workflow management framework to
    automate scripts in order to perform tasks.
    It’s extendable and provides a good monitoring.

    View Slide

  11. DAG
    Directed Acyclic Graph

    View Slide

  12. DAG Overview

    View Slide

  13. DAG View

    View Slide

  14. DAG Run View

    View Slide

  15. Task Duration

    View Slide

  16. DAG as a code

    View Slide

  17. Airflow Operator Support
    • BashOperator
    • DockerOperator
    • EmailOperator
    • HiveOperator
    • HttpOperator
    • JdbcOperator
    • MssqlOperator
    • MysqlOperator
    • OracleOperator
    • PigOperator
    • PostgresOperator
    • SqliteOperator
    • BigQueryOperator
    • DatabricksOperator
    • EmrOperator
    • EcsOperator
    • JiraOperator
    • HipChatOperator
    • SqoopOperator
    • SshExecuteOperator
    • SlackOperator
    • VerticaOperator

    View Slide

  18. Other Features
    • Task pools: limit amount of running tasks
    • Variables: set shared variables (or secrets) via UI or environment variables, use in DAGs later
    • Service level agreements (SLA): know when things did not run or took too long

    View Slide

  19. Airflow Execution Schema

    View Slide

  20. Schedule Interval
    • schedule_interval="*/10 * * * *” – every 10 min
    • schedule_interval="0 */2 * * *” – every 2 hours
    • schedule_interval="0 */1 * * *" - every hour
    • schedule_interval="*/5 * * * *” – every 5 mins

    View Slide

  21. AWS EMR Overview
    Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as
    Apache Hadoop and Apache Spark.
    Storage:
    • S3
    • HDFS
    • Local Disk
    Cluster Resource Management:
    • YARN
    Data Processing Framework:
    • MapReduce
    • Spark

    View Slide

  22. What does EMR do?
    The name EMR is an amalgamation for Elastic and MapReduce. Elastic refers
    to Elastic Cluster, better known as EC2.
    Apache MapReduce is both a programming paradigm and a set of Java SDKs,
    in particular these two Java classes:
    1. apache.hadoop.mapreduce.Mapper;
    2. apache.hadoop.mapreduce.Reducer;

    View Slide

  23. MapReduce
    The concept of Map is
    common to most
    programming languages.
    It means to run some
    function or some collection
    of data.
    Reduce means to count, sum,
    or otherwise create a subset
    of that now reduced data.

    View Slide

  24. What does EMR do?
    These run MapReduce operations and then optionally save the results to an
    Apache Hadoop Distributed File System (HDFS).
    MapReduce is a little bit old-fashioned, since Apache Spark does the same
    thing as that Hadoop-centric approach, but in a more efficient way. That’s
    probably why EMR has both products.

    View Slide

  25. EMR Components

    View Slide

  26. EMR Operator and Sensors
    EmrCreateJobFlowOperator creates the job.
    EmrStepSensor sets up monitoring via the web page.
    EmrTerminateJobFlowOperator removes the cluster.

    View Slide

  27. How to install Airflow
    Airflow is easy to install. EMR takes more steps, which is one reason
    why you might want to use Airflow.
    Beyond the initial setup, however, Amazon makes EMR cluster creation
    easier the second time you use it by saving a script that you can run
    with the Amazon command line interface (CLI).

    View Slide

  28. Airflow Setup
    You basically source a Python environment (e.g., source
    py372/bin/activate, if using virtualenv) then run this to install Airflow,
    which is nothing more than a Python package:
    export AIRFLOW_HOME=~/airflow
    pip install apache-airflow
    airflow db init

    View Slide

  29. Airflow Setup
    Then you create a user.
    airflow users create \
    --username fmar \
    --firstname francesco \
    --lastname marchitelli \
    --role Admin \
    --email [email protected]
    Then you start the web server
    interface, using any available
    port.
    airflow webserver --port 7777

    View Slide

  30. LET’S CODE

    View Slide

  31. Ing. FRANCESCO MARCHITELLI
    https://francescomarchitelli.com

    View Slide