Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Airflow - DAG e Dintorni

Airflow - DAG e Dintorni

Airflow is a job orchestrator useful for performing ETL tasks or custom mapreduce jobs. We will see how it works and how to configure it with AWS on EMR EC2 instance.

Francesco Marchitelli

December 22, 2022
Tweet

More Decks by Francesco Marchitelli

Other Decks in Technology

Transcript

  1. What is data pipeline? Series of steps or actions to

    move and combine data from various sources for analysis or visualization
  2. Orchestration Workflow The planning or coordination of the elements of

    a situation to produce a desired effect. The sequence of steps (tasks) involved in moving from the beginning to the end of a working process.
  3. Data Lake Data Warehouse • Centralized repository • Raw format

    • Schemaless • Any scale • Centralized repository • Transformed • Single consistent schema • Minimum data growth
  4. Python-based workflow management framework to automate scripts in order to

    perform tasks. It’s extendable and provides a good monitoring.
  5. Airflow Operator Support • BashOperator • DockerOperator • EmailOperator •

    HiveOperator • HttpOperator • JdbcOperator • MssqlOperator • MysqlOperator • OracleOperator • PigOperator • PostgresOperator • SqliteOperator • BigQueryOperator • DatabricksOperator • EmrOperator • EcsOperator • JiraOperator • HipChatOperator • SqoopOperator • SshExecuteOperator • SlackOperator • VerticaOperator
  6. Other Features • Task pools: limit amount of running tasks

    • Variables: set shared variables (or secrets) via UI or environment variables, use in DAGs later • Service level agreements (SLA): know when things did not run or took too long
  7. Schedule Interval • schedule_interval="*/10 * * * *” – every

    10 min • schedule_interval="0 */2 * * *” – every 2 hours • schedule_interval="0 */1 * * *" - every hour • schedule_interval="*/5 * * * *” – every 5 mins
  8. AWS EMR Overview Amazon EMR is a managed cluster platform

    that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark. Storage: • S3 • HDFS • Local Disk Cluster Resource Management: • YARN Data Processing Framework: • MapReduce • Spark
  9. What does EMR do? The name EMR is an amalgamation

    for Elastic and MapReduce. Elastic refers to Elastic Cluster, better known as EC2. Apache MapReduce is both a programming paradigm and a set of Java SDKs, in particular these two Java classes: 1. apache.hadoop.mapreduce.Mapper; 2. apache.hadoop.mapreduce.Reducer;
  10. MapReduce The concept of Map is common to most programming

    languages. It means to run some function or some collection of data. Reduce means to count, sum, or otherwise create a subset of that now reduced data.
  11. What does EMR do? These run MapReduce operations and then

    optionally save the results to an Apache Hadoop Distributed File System (HDFS). MapReduce is a little bit old-fashioned, since Apache Spark does the same thing as that Hadoop-centric approach, but in a more efficient way. That’s probably why EMR has both products.
  12. EMR Operator and Sensors EmrCreateJobFlowOperator creates the job. EmrStepSensor sets

    up monitoring via the web page. EmrTerminateJobFlowOperator removes the cluster.
  13. How to install Airflow Airflow is easy to install. EMR

    takes more steps, which is one reason why you might want to use Airflow. Beyond the initial setup, however, Amazon makes EMR cluster creation easier the second time you use it by saving a script that you can run with the Amazon command line interface (CLI).
  14. Airflow Setup You basically source a Python environment (e.g., source

    py372/bin/activate, if using virtualenv) then run this to install Airflow, which is nothing more than a Python package: export AIRFLOW_HOME=~/airflow pip install apache-airflow airflow db init
  15. Airflow Setup Then you create a user. airflow users create

    \ --username fmar \ --firstname francesco \ --lastname marchitelli \ --role Admin \ --email [email protected] Then you start the web server interface, using any available port. airflow webserver --port 7777