Slide 1

Slide 1 text

DigdagͰETLॲཧΛ͢Δ σʔλͱMLपลΤϯδχΞϦϯάΛߟ͑Δձ #2 2019.07.19 த໺ᠳଠ(@tosametal)
 גࣜձࣾϚΠΫϩΞυ ΞϓϦέʔγϣϯΤϯδχΞ

Slide 2

Slide 2 text

ϚΠΫϩΞυʹ͓͚Δػցֶश ޿ࠂ഑৴γεςϜʹ͓͚ΔCTR༧ଌɺCVR༧ଌɺෆਖ਼ΫϦοΫͷݕग़ͳͲ

Slide 3

Slide 3 text

ϩάج൫ͷߏ੒ Imp
 Server Click
 Server RTB Server Kafka Hadoop (σʔλ΢ΣΞϋ΢ε) Digdag Hadoop (෼ੳج൫)

Slide 4

Slide 4 text

ϩάج൫ͷߏ੒ Imp
 Server Click
 Server RTB Server Kafka Hadoop (σʔλ΢ΣΞϋ΢ε) Digdag Hadoop (෼ੳج൫) at least once ϢχʔΫͳIDʹΑΔॏෳഉআ sessionͰ؅ཧ ႈ౳ͳॲཧ Kafka secondaryͰ kafkaΛࢦఆ jsonܗࣜͷ ߏ଄Խσʔλ

Slide 5

Slide 5 text

Digdagͱ͸ digϑΝΠϧʹએݴతʹϫʔΫϑϩʔΛهड़ Workflow as code εέδϡʔϧ࣮ߦɺϦΧόϦ UI͔Βਐḿͷ֬ೝ΍࠶࣮ߦ͕Մೳ ΦϖϨʔλΛࣗ࡞Մೳ

Slide 6

Slide 6 text

PostgreSQL ࣮ߦཤྺͳͲΛอଘ Task͝ͱʹhadoopΫϥΠΞϯτ ͱͳΔίϯςφΛ্ཱͪ͛Δ εέʔϧΞ΢τՄೳ όον࣮ߦج൫ߏ੒

Slide 7

Slide 7 text

ෳࡶͳґଘؔ܎Λ੍ޚͭͭ͠ ϫʔΫϑϩʔͷՄಡੑΛอͭ

Slide 8

Slide 8 text

ϓϩδΣΫτΛػೳ୯ҐͰ෼ׂ ϓϩδΣΫτͱ͸ In Digdag, workflows are packaged together with other files used in the workflows. The files can be anything such as SQL scripts, Python/Ruby/Shell scripts, configuration files, etc. This set of the workflow definitions is called project. ެࣜυΩϡϝϯτ(http://docs.digdag.io/)ΑΓҾ༻ ϚΠΫϩΞυͰ͸ݱࡏ໿60ݸͷϓϩδΣΫτ͕ಈ͍͍ͯΔ

Slide 9

Slide 9 text

ϓϩδΣΫτ಺ͷґଘؔ܎ schedule: daily>: 12:00:00 +task1: _parallel: true +subtask1: call>: subtask1.dig +subtask2: call>: subtask2.dig +task2: echo>: task finished successfully ●callΦϖϨʔλΛ࢖͏͜ͱͰdigϑΝΠϧ ͷ෼ׂΛߦ͏͜ͱ͕Մೳ ●requireΛ࢖͏ͱ΋͏গ͠ෳࡶͳDAGͷ දݱ΋Մೳ subtask1 subtask2 task2

Slide 10

Slide 10 text

ϓϩδΣΫτؒͷґଘؔ܎ ϓϩδΣΫτA ϓϩδΣΫτB ௚઀ଞͷϓϩδΣ Ϋτͷ݁ՌΛݟΔ ͜ͱ͸ग़དྷͳ͍

Slide 11

Slide 11 text

ϓϩδΣΫτؒͷґଘؔ܎ +touch_task: s3_touch>: bucket/flag/fileX +wait_task: s3_wait>: bucket/flag/fileX ϓϩδΣΫτB ϓϩδΣΫτA fileX ࣗ࡞ΦϖϨʔλ ࢀߟ:https://github.com/ tosametal/digdag-plugins

Slide 12

Slide 12 text

ͦͷଞ ϫʔΫϑϩʔશମΛႈ౳ʹ͢Δ ● hiveΫΤϦ͸insert overwrite ● distcp͸overwrite deleteΦϓγϣϯΛࢦఆ ϦτϥΠΛઃఆ͢Δ ● exponential interval

Slide 13

Slide 13 text

·ͱΊ ● ϓϩδΣΫτ͸ංେԽ͠ͳ͍Α͏ʹػೳͰ෼ׂ ● ϓϩδΣΫτؒͷґଘ͸s3_waitͰղܾ ● Α͘࢖͏ػೳ͸ϓϥάΠϯΛ࡞Ζ͏

Slide 14

Slide 14 text

No content