Airflow - DAG e Dintorni

AIRFLOW DAG E DINTORNI

Ing. FRANCESCO MARCHITELLI https://francescomarchitelli.com

Agenda • Data Pipeline • Data Architecture • DAG •
Airflow • EMR • Let’s Code

What is data pipeline? Series of steps or actions to
move and combine data from various sources for analysis or visualization

Orchestration Workflow The planning or coordination of the elements of
a situation to produce a desired effect. The sequence of steps (tasks) involved in moving from the beginning to the end of a working process.

Traditional Data Architecture Characteristics: • Schema-on-write • ETL • High-cost
storage

Modern Data Architecture Characteristics: • ELT • Distributed computing •
Schema-on-read • Lower-cost storage

Data Lake Data Warehouse • Centralized repository • Raw format
• Schemaless • Any scale • Centralized repository • Transformed • Single consistent schema • Minimum data growth

Data Platform Architecture

Python-based workflow management framework to automate scripts in order to
perform tasks. It’s extendable and provides a good monitoring.

DAG Directed Acyclic Graph

DAG Overview

DAG View

DAG Run View

Task Duration

DAG as a code

Airflow Operator Support • BashOperator • DockerOperator • EmailOperator •
HiveOperator • HttpOperator • JdbcOperator • MssqlOperator • MysqlOperator • OracleOperator • PigOperator • PostgresOperator • SqliteOperator • BigQueryOperator • DatabricksOperator • EmrOperator • EcsOperator • JiraOperator • HipChatOperator • SqoopOperator • SshExecuteOperator • SlackOperator • VerticaOperator

Other Features • Task pools: limit amount of running tasks
• Variables: set shared variables (or secrets) via UI or environment variables, use in DAGs later • Service level agreements (SLA): know when things did not run or took too long

Airflow Execution Schema

Schedule Interval • schedule_interval="*/10 * * * *” – every
10 min • schedule_interval="0 */2 * * *” – every 2 hours • schedule_interval="0 */1 * * *" - every hour • schedule_interval="*/5 * * * *” – every 5 mins

AWS EMR Overview Amazon EMR is a managed cluster platform
that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark. Storage: • S3 • HDFS • Local Disk Cluster Resource Management: • YARN Data Processing Framework: • MapReduce • Spark

What does EMR do? The name EMR is an amalgamation
for Elastic and MapReduce. Elastic refers to Elastic Cluster, better known as EC2. Apache MapReduce is both a programming paradigm and a set of Java SDKs, in particular these two Java classes: 1. apache.hadoop.mapreduce.Mapper; 2. apache.hadoop.mapreduce.Reducer;

MapReduce The concept of Map is common to most programming
languages. It means to run some function or some collection of data. Reduce means to count, sum, or otherwise create a subset of that now reduced data.

What does EMR do? These run MapReduce operations and then
optionally save the results to an Apache Hadoop Distributed File System (HDFS). MapReduce is a little bit old-fashioned, since Apache Spark does the same thing as that Hadoop-centric approach, but in a more efficient way. That’s probably why EMR has both products.

EMR Components

EMR Operator and Sensors EmrCreateJobFlowOperator creates the job. EmrStepSensor sets
up monitoring via the web page. EmrTerminateJobFlowOperator removes the cluster.

How to install Airflow Airflow is easy to install. EMR
takes more steps, which is one reason why you might want to use Airflow. Beyond the initial setup, however, Amazon makes EMR cluster creation easier the second time you use it by saving a script that you can run with the Amazon command line interface (CLI).

Airflow Setup You basically source a Python environment (e.g., source
py372/bin/activate, if using virtualenv) then run this to install Airflow, which is nothing more than a Python package: export AIRFLOW_HOME=~/airflow pip install apache-airflow airflow db init

Airflow Setup Then you create a user. airflow users create
\ --username fmar \ --firstname francesco \ --lastname marchitelli \ --role Admin \ --email [email protected] Then you start the web server interface, using any available port. airflow webserver --port 7777

LET’S CODE

Ing. FRANCESCO MARCHITELLI https://francescomarchitelli.com

Airflow - DAG e Dintorni

Airflow - DAG e Dintorni

Francesco Marchitelli

More Decks by Francesco Marchitelli

Other Decks in Technology

Featured

Transcript