Slide 1

Slide 1 text

Building Data Pipeline with Apache Airflow บีท Senior Data Engineer จาก CJ Express

Slide 2

Slide 2 text

A g e n d a What is data engineer? What is data pipeline? What is traditional Pipeline and Why it’s not work ? What is Apache Airflow? Why Apache airflow? Core component of Airflow? What is DAG? What is Tasks, Operator, Dependency ? User interface ? Let’s Hand-On

Slide 3

Slide 3 text

B o o k s & L e a r n i n g s

Slide 4

Slide 4 text

C e r t i f i c a t i o n

Slide 5

Slide 5 text

เนื้อหาวันนี้เหมาะกับใคร ? ความรู้พื้นฐานวันนี้ อยากทำความรู้จักเครื่องมือในการทำงานของ Data Engineer

Slide 6

Slide 6 text

W h a t i s D a t a E n g i n e e r ?

Slide 7

Slide 7 text

W h a t i s D a t a P i p e l i n e ? Data pipelineis a means of moving data from one place (the source) to a destination (such as a data warehouse). Along the way, data is transformedand optimized, arriving in a state that can be analyzedand used to develop business insights.

Slide 8

Slide 8 text

ELEMENT OF D a t a P i p e l i n e SOURCE PROCESSING DESTINATION

Slide 9

Slide 9 text

ESSENTIAL ELEMENT OF D a t a P i p e l i n e SOURCE PROCESSING DESTINATION

Slide 10

Slide 10 text

TRADITIONAL D a t a P i p e l i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Run Python Bash Python Server

Slide 11

Slide 11 text

TRADITIONAL D a t a P i p e l i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain ขอ Sale Report ของสิินค้้าทุกวันัได้้ ไหมค่่ะ ? อยากดู Report ทุก9 โมงเช้าได้้ไหม?

Slide 12

Slide 12 text

TRADITIONAL D a t a P i p e l i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain

Slide 13

Slide 13 text

TRADITIONAL C r o n J o b Data Engineer Python Scheduling

Slide 14

Slide 14 text

TRADITIONAL D a t a P i p e l i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Pipeline Scheduling : Daily

Slide 15

Slide 15 text

TRADITIONAL D a t a P i p e l i n e 5:00 Extract SOURCE 7:00 Transform PROCESSING 8:00 DESTINATION Load CONSUME CEO Marketing Team Domain Pipeline Scheduling : Daily

Slide 16

Slide 16 text

TRADITIONAL D a t a P i p e l i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Pipeline can fail at any state !!!

Slide 17

Slide 17 text

TRADITIONAL D a t a P i p e l i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Connection to Source database Fail ?

Slide 18

Slide 18 text

TRADITIONAL D a t a P i p e l i n e SOURCE Data Engineer PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Data Transforming Fail ?

Slide 19

Slide 19 text

TRADITIONAL D a t a P i p e l i n e SOURCE PROCESSING DESTINATION CONSUME CEO Marketing Team Domain Data Loss in Destination ?

Slide 20

Slide 20 text

จะเกิดปัญหาอะไรได้อีกบ้าง ? ? ?

Slide 21

Slide 21 text

TRADITIONAL D a t a P i p e l i n e 5:00 Extract SOURCE Extract 7:00 Transform PROCESSING Transform 8:00 DESTINATION Load CONSUME CEO Marketing Team Domain Pipeline Scheduling : Daily DATA วันนี้ ยังไม่มา

Slide 22

Slide 22 text

ESSENTIAL ELEMENT OF D a t a P i p e l i n e SO SOUR S OU RC CE SOURCE SOUCEE SOUR S OUR RC CE SOUCEE SOUR S OUR RC CE E SOURC SOU CE S OUR CE S OURUR E SOURCCERCEE PPRROOC I P CESESSISNG PROCOCESISN STNATI PRROESISNG DDESTSINATOION P SINGG DEDEDETSITNIATONIONN PRCROCEESSISNSIG DESTNATI P DESIETNIATNNATIATOION PRO ESNG DS N PROCCESISNNG ROC G DESI NATIONTION PROCESIESSISNING DETSITN PROCESSINGG DESISINATIO ETTINATIN P D O PROCESROCESSISNG DESETSI TON PROOCCESIESNSISNGINGG DTNAINAIIONO DESETSINATTIN TINAATIONTIONN Maintain Multiple data pipelins

Slide 23

Slide 23 text

TRADITIONAL D a t a P i p e l i n e s Maintain data pipeline version

Slide 24

Slide 24 text

? ? ? THE PROBLEM OF TRADITIONAL D a t a P i p e l i n e s Scalability ในแง่ของการจัดการ – การจัดการ Pipeline จำนวนมาก ทั้งใน เรื่องของ Script และตารางเวลา (Schedule) ที่เกี่ยวข้องกับความถี่ในการรัน Pipeline Scalability ในแง่ของการประมวลผล – ในเรื่องของประสิทธิภาพ จะทำ อย่างไรถ้าแต่ละ Task ใช้เวลารันนาน หรือรันไม่สำเร็จ หรือหากทรัพยากรใน การรันไม่เพียงพอ จะขยายอย่างไร? การเชื่อมต่อกับระบบต่าง ๆ – การเชื่อมต่อกับฐานข้อมูล เช่น RDBMS, AWS, Hive, HDFS ฯลฯ โดยมีการตั้งค่า เช่น Host Address, Port, ID, Password, Schema ฯลฯ จะจัดการอย่างไร? การตรวจสอบ (Monitoring) – จะติดตามผลอย่างไรถ้ารันไม่สำเร็จ? การรันซ้ำ (Re-running) – จะสามารถรันซ้ำในขั้นตอนเฉพาะเจาะจงได้ อย่างไร?

Slide 25

Slide 25 text

LOOK IN TO D a t a S o u r c e SOURCE

Slide 26

Slide 26 text

LOOK IN TO D a t a P r o c e s s i n g PROCESSING

Slide 27

Slide 27 text

LOOK IN TO D a t a S o u r c e S o u r c e

Slide 28

Slide 28 text

มาใช้ Apache Airflow กันเถอะ

Slide 29

Slide 29 text

WHAT IS A p a c h e A i r f l o w ? • • Airflowis an open-sourceplatformfor developing, scheduling,and monitoringbatch-oriented workflows Create by Airbnb

Slide 30

Slide 30 text

WHAT IS NOT A p a c h e A i r f l o w ? https://k21academy.com/microsoft-azure/data-engineer/batch-processing-vs-stream-processing/ • • Airflow was not built for infinitely running event-basedworkflows. Airflow is not a streamingsolution. If you prefer clicking over coding, Airflow is probably not the right solution.

Slide 31

Slide 31 text

WHAT IS A p a c h e A i r f l o w ?

Slide 32

Slide 32 text

WHAT IS A p a c h e A i r f l o w ? W o r k f l o w a s a C o d e

Slide 33

Slide 33 text

WHAT IS A p a c h e A i r f l o w ?

Slide 34

Slide 34 text

WHAT IS A p a c h e A i r f l o w ?

Slide 35

Slide 35 text

WHAT IS A p a c h e A i r f l o w ?

Slide 36

Slide 36 text

WHAT IS A p a c h e A i r f l o w ?

Slide 37

Slide 37 text

WHAT IS D A G s ? D a t a P i p e l i n e a s a G r a p h Weather API Fetch and Clean Data Weather Dashboard

Slide 38

Slide 38 text

WHAT IS D A G s ?

Slide 39

Slide 39 text

APACHE AIRFLOW P i p e l i n e a s D A G

Slide 40

Slide 40 text

WHAT IS A i r f l o w C o r e C o m p o n e n t s ? QUEUE DAGS DIRECTORY WEB SERVER EXECUTOR SCHEDULER WORKERS META DATABASE

Slide 41

Slide 41 text

WHAT IS A i r f l o w C o r e C o m p o n e n t s ? DAGS DIRECTORY WEB SERVER META DATABASE SCHEDULER QUEUE EXECUTOR WORKERS User Monitor DAG run results Write data pipeline as a code Python as a Airflow DAGs DAGs files in Python Keep: -DAG status -Tasks status (passed/failed) Run heartbeat function to: -Update "Last_updated" -Run kill _zombies()

Slide 42

Slide 42 text

WHAT IS A i r f l o w K e y C o n c e p t ? DAG-Directed Acyclic graph . the graphical representation of your data pipeline Operator-describes a single task in your data pipeline Task-An instance of operator task. Workflow-DAG + Operator + Task DAG WORKFLOW TASK OPERATOR

Slide 43

Slide 43 text

WHAT IS T a s k ? Weather API Task node Fetch and Clean Data Task dependency Weather Dashboard DAG Fetch weather forcast Clean forcast data Push data to dashboard

Slide 44

Slide 44 text

WHAT IS T a s k S t a t e ?

Slide 45

Slide 45 text

WHAT IS O p e r a t o r s ?

Slide 46

Slide 46 text

• • • • • • • • • • • • • • • • • • • • • • BashOperator, PythonOperator DockerOperator SparkSQLOperator SparkSubmitOperator HiveOperator PostresOperator MySqlOperator BigQuryetor EmailOperator SlackOperator AzureBlobStorageToGCSOperat or GCSToGCSOperator OracleToGCSOperator GCSToSFTPOperator SalesforceToGcsOperator PostgresToGCSOperator MSSQLToGCSOperator SqlSensor S3KeySensor DateTimeSensor BashSensor WHAT IS O p e r a t o r s ? ACTION OPERATOR TRANSFER OPERATOR SENSOR OPERATOR https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/transfer/index.html

Slide 47

Slide 47 text

WHAT IS O p e r a t o r s ? ACTION OPERATOR SENSOR OPERATOR TRANSFER OPERATOR DAG Fetch weather forcast Clean forcast data Push data to dashboard Opertator - 1 Opertator - 2 Opertator - 3

Slide 48

Slide 48 text

WHAT IS O p e r a t o r s ? ACTION OPERATOR SENSOR OPERATOR TRANSFER OPERATOR DAG Fetch weather forcast Clean forcast data Push data to dashboard Opertator - 1 Opertator – 1 Example: MySqlOperator Opertator - 2 Opertator - 3

Slide 49

Slide 49 text

WHAT IS A i r f l o w S c h e d u l i n g ? • • The scheduler runs job one schedule interval AFTER the start date, at the END of the period Backfill: run DAG for any interval that has not been run or cleared

Slide 50

Slide 50 text

WHAT IS D A G s C o n n e c t i o n ? • • • Information such as hostname, port, login and passwords to other systems and services is handled in the 'Connection' section of the UI. The pipeline code you will author will reference the 'conn_id' of the Connection objects. The information is saved in the dbthat Airflow manages, there is an option to encrypt passwords.

Slide 51

Slide 51 text

WHAT IS A i r f l o w A l e r t i n g ? On event: - Retry/failure/success - Timeout - SLAS Using: - Email - SlackOperator - Callback

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Q & A

Slide 54

Slide 54 text

L E T ’ S D E M O