Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Building Data Pipeline with Apache Airflow (101)

Punsiri Boonyakiat
November 27, 2024
67

Building Data Pipeline with Apache Airflow (101)

The session demos Apache Airflow, an open-source tool for orchestrating workflows. It allows you to create and manage complex data pipelines with scheduling and monitoring features. Airflow uses Directed Acyclic Graphs (DAGs) to define the sequence of tasks. Each task in a DAG represents a step in the workflow. This makes it ideal for automating ETL processes and data engineering tasks.

Punsiri Boonyakiat

November 27, 2024
Tweet

Transcript

  1. A g e n d a What is data engineer?

    What is data pipeline? What is traditional Pipeline and Why it’s not work ? What is Apache Airflow? Why Apache airflow? Core component of Airflow? What is DAG? What is Tasks, Operator, Dependency ? User interface ? Let’s Hand-On
  2. W h a t i s D a t a

    E n g i n e e r ?
  3. W h a t i s D a t a

    P i p e l i n e ? Data pipelineis a means of moving data from one place (the source) to a destination (such as a data warehouse). Along the way, data is transformedand optimized, arriving in a state that can be analyzedand used to develop business insights.
  4. ELEMENT OF D a t a P i p e

    l i n e SOURCE PROCESSING DESTINATION
  5. ESSENTIAL ELEMENT OF D a t a P i p

    e l i n e SOURCE PROCESSING DESTINATION
  6. TRADITIONAL D a t a P i p e l

    i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Run Python Bash Python Server
  7. TRADITIONAL D a t a P i p e l

    i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain ขอ Sale Report ของสิินค้้าทุกวันัได้้ ไหมค่่ะ ? อยากดู Report ทุก9 โมงเช้าได้้ไหม?
  8. TRADITIONAL D a t a P i p e l

    i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain
  9. TRADITIONAL D a t a P i p e l

    i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Pipeline Scheduling : Daily
  10. TRADITIONAL D a t a P i p e l

    i n e 5:00 Extract SOURCE 7:00 Transform PROCESSING 8:00 DESTINATION Load CONSUME CEO Marketing Team Domain Pipeline Scheduling : Daily
  11. TRADITIONAL D a t a P i p e l

    i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Pipeline can fail at any state !!!
  12. TRADITIONAL D a t a P i p e l

    i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Connection to Source database Fail ?
  13. TRADITIONAL D a t a P i p e l

    i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Data Transforming Fail ?
  14. TRADITIONAL D a t a P i p e l

    i n e SOURCE PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Data Loss in Destination ?
  15. TRADITIONAL D a t a P i p e l

    i n e 5:00 Extract SOURCE Extract 7:00 Transform PROCESSING Transform 8:00 DESTINATION Load CONSUME CEO Marketing Team Domain Pipeline Scheduling : Daily DATA วันนี้ ยังไม่มา
  16. ESSENTIAL ELEMENT OF D a t a P i p

    e l i n e SO SOUR S OU RC CE SOURCE SOUCEE SOUR S OUR RC CE SOUCEE SOUR S OUR RC CE E SOURC SOU CE S OUR CE S OURUR E SOURCCERCEE PPRROOC I P CESESSISNG PROCOCESISN STNATI PRROESISNG DDESTSINATOION P SINGG DEDEDETSITNIATONIONN PRCROCEESSISNSIG DESTNATI P DESIETNIATNNATIATOION PRO ESNG DS N PROCCESISNNG ROC G DESI NATIONTION PROCESIESSISNING DETSITN PROCESSINGG DESISINATIO ETTINATIN P D O PROCESROCESSISNG DESETSI TON PROOCCESIESNSISNGINGG DTNAINAIIONO DESETSINATTIN TINAATIONTIONN Maintain Multiple data pipelins
  17. TRADITIONAL D a t a P i p e l

    i n e s Maintain data pipeline version
  18. ? ? ? THE PROBLEM OF TRADITIONAL D a t

    a P i p e l i n e s Scalability ในแง่ของการจัดการ – การจัดการ Pipeline จำนวนมาก ทั้งใน เรื่องของ Script และตารางเวลา (Schedule) ที่เกี่ยวข้องกับความถี่ในการรัน Pipeline Scalability ในแง่ของการประมวลผล – ในเรื่องของประสิทธิภาพ จะทำ อย่างไรถ้าแต่ละ Task ใช้เวลารันนาน หรือรันไม่สำเร็จ หรือหากทรัพยากรใน การรันไม่เพียงพอ จะขยายอย่างไร? การเชื่อมต่อกับระบบต่าง ๆ – การเชื่อมต่อกับฐานข้อมูล เช่น RDBMS, AWS, Hive, HDFS ฯลฯ โดยมีการตั้งค่า เช่น Host Address, Port, ID, Password, Schema ฯลฯ จะจัดการอย่างไร? การตรวจสอบ (Monitoring) – จะติดตามผลอย่างไรถ้ารันไม่สำเร็จ? การรันซ้ำ (Re-running) – จะสามารถรันซ้ำในขั้นตอนเฉพาะเจาะจงได้ อย่างไร?
  19. LOOK IN TO D a t a P r o

    c e s s i n g PROCESSING
  20. LOOK IN TO D a t a S o u

    r c e S o u r c e
  21. WHAT IS A p a c h e A i

    r f l o w ? • • Airflowis an open-sourceplatformfor developing, scheduling,and monitoringbatch-oriented workflows Create by Airbnb
  22. WHAT IS NOT A p a c h e A

    i r f l o w ? https://k21academy.com/microsoft-azure/data-engineer/batch-processing-vs-stream-processing/ • • Airflow was not built for infinitely running event-basedworkflows. Airflow is not a streamingsolution. If you prefer clicking over coding, Airflow is probably not the right solution.
  23. WHAT IS A p a c h e A i

    r f l o w ? W o r k f l o w a s a C o d e
  24. WHAT IS D A G s ? D a t

    a P i p e l i n e a s a G r a p h Weather API Fetch and Clean Data Weather Dashboard
  25. WHAT IS A i r f l o w C

    o r e C o m p o n e n t s ? QUEUE DAGS DIRECTORY WEB SERVER EXECUTOR SCHEDULER WORKERS META DATABASE
  26. WHAT IS A i r f l o w C

    o r e C o m p o n e n t s ? DAGS DIRECTORY WEB SERVER META DATABASE SCHEDULER QUEUE EXECUTOR WORKERS User Monitor DAG run results Write data pipeline as a code Python as a Airflow DAGs DAGs files in Python Keep: -DAG status -Tasks status (passed/failed) Run heartbeat function to: -Update "Last_updated" -Run kill _zombies()
  27. WHAT IS A i r f l o w K

    e y C o n c e p t ? DAG-Directed Acyclic graph . the graphical representation of your data pipeline Operator-describes a single task in your data pipeline Task-An instance of operator task. Workflow-DAG + Operator + Task DAG WORKFLOW TASK OPERATOR
  28. WHAT IS T a s k ? Weather API Task

    node Fetch and Clean Data Task dependency Weather Dashboard DAG Fetch weather forcast Clean forcast data Push data to dashboard
  29. • • • • • • • • • •

    • • • • • • • • • • • • BashOperator, PythonOperator DockerOperator SparkSQLOperator SparkSubmitOperator HiveOperator PostresOperator MySqlOperator BigQuryetor EmailOperator SlackOperator AzureBlobStorageToGCSOperat or GCSToGCSOperator OracleToGCSOperator GCSToSFTPOperator SalesforceToGcsOperator PostgresToGCSOperator MSSQLToGCSOperator SqlSensor S3KeySensor DateTimeSensor BashSensor WHAT IS O p e r a t o r s ? ACTION OPERATOR TRANSFER OPERATOR SENSOR OPERATOR https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/transfer/index.html
  30. WHAT IS O p e r a t o r

    s ? ACTION OPERATOR SENSOR OPERATOR TRANSFER OPERATOR DAG Fetch weather forcast Clean forcast data Push data to dashboard Opertator - 1 Opertator - 2 Opertator - 3
  31. WHAT IS O p e r a t o r

    s ? ACTION OPERATOR SENSOR OPERATOR TRANSFER OPERATOR DAG Fetch weather forcast Clean forcast data Push data to dashboard Opertator - 1 Opertator – 1 Example: MySqlOperator Opertator - 2 Opertator - 3
  32. WHAT IS A i r f l o w S

    c h e d u l i n g ? • • The scheduler runs job one schedule interval AFTER the start date, at the END of the period Backfill: run DAG for any interval that has not been run or cleared
  33. WHAT IS D A G s C o n n

    e c t i o n ? • • • Information such as hostname, port, login and passwords to other systems and services is handled in the 'Connection' section of the UI. The pipeline code you will author will reference the 'conn_id' of the Connection objects. The information is saved in the dbthat Airflow manages, there is an option to encrypt passwords.
  34. WHAT IS A i r f l o w A

    l e r t i n g ? On event: - Retry/failure/success - Timeout - SLAS Using: - Email - SlackOperator - Callback