Slide 1

Slide 1 text

DATA ENGINEERING BY SURA SIT LLANGPORNRATTANA

Slide 2

Slide 2 text

WHAT IS DATA ENGINEERING ?

Slide 3

Slide 3 text

WHY ?

Slide 4

Slide 4 text

WHERE ? ' ' i it .

Slide 5

Slide 5 text

WHERE ? I Et Et Et A I 1¥ Et I H¥ 㱺 o ¥T ' ' ¥1 it , - - ¥, I I ' ¥, I I Et Et Et ⾨, Et Et

Slide 6

Slide 6 text

HOW ? pIpEh2N ⾨ EE , Et ⾨ i 㱺

Slide 7

Slide 7 text

WHAT IS DATA ENGINEERING ?

Slide 8

Slide 8 text

DATA PIPELINE 1- L INGESTION STORAGE y T ¥ . PROCESSING

Slide 9

Slide 9 text

PRELUDE ¥FE㱺* * WE q¥¥E¥¥ kobernetes qq.g.re.msoiq.info {wm¥eoE#BE o O O O r¥¥¥÷¥¥'¥i¥¥ Eisk T 1- EI HI

Slide 10

Slide 10 text

USE - CASE PAGES ? , " Y WHO ? INTEREST 'M are events ? - more . . . .

Slide 11

Slide 11 text

DATA FLOW RAW DATA " HDFS VIEW

Slide 12

Slide 12 text

DATA FLOW RAW DATA STORAGE PROTO BUF → oo - HDFS VIEW

Slide 13

Slide 13 text

DATA FLOW RAW DATA STORAGE 11411111¥09 " - o ✓ HDFS § ENCODE VIEW

Slide 14

Slide 14 text

DATA FLOW JSON ra "w DATA puffer STORAGE ⊥ 1/11/44 9 " - n ✓ HDAS ENCODE VIEW

Slide 15

Slide 15 text

ARCHITECTURE FIT ¥¥÷÷÷÷¥÷÷i÷¥ :* a '

Slide 16

Slide 16 text

LOGGING - FOR STREAMING DATA PROCESSING *e¥÷÷÷÷E¥e÷÷¥ .ie :* - ' ¥ ' I 1 I

Slide 17

Slide 17 text

CENTRALIZED LOGGING - n I f¥¥÷÷÷¥¥n¥¥i¥¥¥¥ .ES - I HE ET Eh €1 - I I I ⑤ ELUENTD I . INPUT # # ⑤ N - FILTER lTRALfZEfoasEf : :c:*:

Slide 18

Slide 18 text

FLUENTD ACCESS Lode ¥e¥y ' Ip - - TIMESTAMP REQUEST USER - AGENT " # f) mess zooming WPI - MY .mn#EeHa I # LT * - T FILTER it - BUFFER 㱺 JSON - TRANSFORMS - J - FILTER - PERFORMANCE - L IP : . . . i - ENRICH - RELIABILITY ICHIKI TIMESTAMP : . . . , ICHUNKI REQUEST : . . . , - THREAD SAMY = ICHIKI USER - AGENT ' . . . . y LT L output # LT - E - WRITE OR SEND lodes - SYNC OR A- SYNC

Slide 19

Slide 19 text

BUFFER - n ¥¥÷÷:÷¥¥¥¥i÷¥÷E¥ - = I - N - I -1 He HIT It ¥1 I 1 I 8- kafka bio * * * n - HIGH THROUGHPUT \ I / Eas - RE - PLAY • CENTRALIZE -1 o . - i¥ - FAULT TOLERANCE TO

Slide 20

Slide 20 text

KAFKA ¥:* . . - EI BYTES OF LET SERIALIZED r JSON n n z seem :* :㱺i⾨EEf ' ⾨ 㱺 㱺 \TEAM_ RE - PLAY

Slide 21

Slide 21 text

PROTOCOLS S SERIALIZATION - n I ¥¥¥¥÷i¥¥÷¥¥÷¥¥¥ - I HE HIT T €1 - Biao LOG STASH I l l - INPUT - KAFKA - FILTER # # # N - OUTPUT - PROTOBUF \ I / Ea¥ SITDTHARDTZEI Poo PROTOBUF # 1- 8oi.IE#EIfiIEl

Slide 22

Slide 22 text

LOG STASH 1NPU ~ MELT BUFFER - n BYTES OF 㱺 - PERFORMANCE SERIALIZED - RELIABILITY lPAG ← HEAD - THREAD SAMY IPA4t ← TALL JSON lPAat ← TAIL LT a ⊥ 㱺 Proto Boe FtLTER - TRANSFORM - FILTER to OUTPUT IT - ENRICH - - CODEC - WRITE OR SEND codes -

Slide 23

Slide 23 text

PROTO BUE - SMALL → FAST - SIMPLE , KEY - VALUE - STRUCTURED DATA - SUPPORT MANY LANGUAGES n @04SER_VETf8gEoPR0ToButylBiBiaaa.L 09 STASH Elf III HII

Slide 24

Slide 24 text

SCHEDULING JOBS - FOR BATCH DATA PROCESSING - n ¥¥¥÷i÷¥¥¥¥¥¥÷÷ - = I - N - I -1 He HIT IT El I 1 I ¥B¥df: Airflow t.EE#EE:DPROT0B0F - WORKFLOW ⑤ ⑤ ⑤ n - SCHEDULER ) I / foas¥ ° Biao t PROTOBUF # - MONITORING - ¥7 LEI FI O

Slide 25

Slide 25 text

AIRFLOW ÷i÷÷÷÷÷

Slide 26

Slide 26 text

p TASK 3 - I y ¥ TASK 't - Tasks - Task 3.2 - Task 4 foBE¥%)%§ - spark ← µfB← Motogp ⊥ TASK 3.3 at \ \ EEE. ¥¥*:m÷¥¥÷¥o¥ FE*o¥¥ha is ✓ I \ - IDEMPOTENT - f f - ¥21 ITASKTI Itasca - STATELESS - - - - PREFER INSERT s¥ TO UPDATE - PARTITION - BY TIME

Slide 27

Slide 27 text

LOG SERVER . 1- - ¥÷i¥÷¥÷¥÷E - f i ¥1 - N - 1- He - o.o EI ¥1 it I go kafka Et , i n - 㱺¥qs9D Egg elastic search Baa S # # # n Eiseman kibana \ If Eo o;:Ea¥aa* Poo Dinamo t PROTOBUF 1£ ¥iE¥÷÷÷¥¥÷ hadoop 80 - § : EsgEiE÷±¥÷⾨÷÷ ÷i¥a÷iE¥*¥* + 'E' IEEE .io#arE:*@eqT Elida

Slide 28

Slide 28 text

KAPPA ARCHITECTURE INPUT STREAMING SERVING OUTPUT - 1¥11 - - Q I I STORA DE it

Slide 29

Slide 29 text

LOG SERVER INPUT STREAMING SERVING OUTPUT FI FI - 87oi.es#auzeEFE-iIqEmEfE - Q FI & PROTOBUF BEEM DEFINED PROTO BUF I > \ / SCHEMA STORAGE gfqgg.EE#*EEqao:o:.I7oi '*¥w*㱺 a :i¥¥¥¥¥¥¥¥ :* :* :. HDFS t HIVE ( SERVING )

Slide 30

Slide 30 text

PATH ARCHITECTURE LAMBDA INPUT STREAMING SERVING OUTPUT - II - - Q \ i BATCH STORAGE - '

Slide 31

Slide 31 text

DATA PIPELINE → esparto BATCH T - T I ¥¥¥¥i¥¥¥¥÷¥¥¥ .in#s-I/Et 1¥ = # ACQUISITION 1- je PROCESSING SINGESTLON ¥1 i €1 ⾨ I 1 I I÷*s÷s * * * n ACQUISITION ) I / STORAGE Poo Dinamo t PROTOBUF 1£ S O - *⾨÷÷oEaao¥n TO jog cEEozgEE¥i¥¥÷¥ ÷a:÷:*÷÷** . ACCESS STREAMING ⊥ Eigg PROCESSING E%Biaa SINGES -110N

Slide 32

Slide 32 text

DATA LAKE N GATEWAY ~ ¥sER) - # t E' Ex 9*7 IN EE ESSE i÷a¥** - + 'Eh IEE.EE.ge#EE' ⾨ 1*973 tEaa Sum

Slide 33

Slide 33 text

WE ARE HIRING ! ! !

Slide 34

Slide 34 text

QSA