DigdagでETL処理をする

 DigdagでETL処理をする

データとML周辺エンジニアリングを考える会 #2

https://data-engineering.connpass.com/event/136756/

#data_ml_engineering

C467856ec426b72e44980a58c1466bd9?s=128

tosametal

July 19, 2019
Tweet

Transcript

  1. DigdagͰETLॲཧΛ͢Δ σʔλͱMLपลΤϯδχΞϦϯάΛߟ͑Δձ #2 2019.07.19 த໺ᠳଠ(@tosametal)
 גࣜձࣾϚΠΫϩΞυ ΞϓϦέʔγϣϯΤϯδχΞ

  2. ϚΠΫϩΞυʹ͓͚Δػցֶश ޿ࠂ഑৴γεςϜʹ͓͚ΔCTR༧ଌɺCVR༧ଌɺෆਖ਼ΫϦοΫͷݕग़ͳͲ

  3. ϩάج൫ͷߏ੒ Imp
 Server Click
 Server RTB Server Kafka Hadoop (σʔλ΢ΣΞϋ΢ε)

    Digdag Hadoop (෼ੳج൫)
  4. ϩάج൫ͷߏ੒ Imp
 Server Click
 Server RTB Server Kafka Hadoop (σʔλ΢ΣΞϋ΢ε)

    Digdag Hadoop (෼ੳج൫) at least once ϢχʔΫͳIDʹΑΔॏෳഉআ sessionͰ؅ཧ ႈ౳ͳॲཧ Kafka secondaryͰ kafkaΛࢦఆ jsonܗࣜͷ ߏ଄Խσʔλ
  5. Digdagͱ͸ digϑΝΠϧʹએݴతʹϫʔΫϑϩʔΛهड़ Workflow as code εέδϡʔϧ࣮ߦɺϦΧόϦ UI͔Βਐḿͷ֬ೝ΍࠶࣮ߦ͕Մೳ ΦϖϨʔλΛࣗ࡞Մೳ

  6. PostgreSQL ࣮ߦཤྺͳͲΛอଘ Task͝ͱʹhadoopΫϥΠΞϯτ ͱͳΔίϯςφΛ্ཱͪ͛Δ εέʔϧΞ΢τՄೳ όον࣮ߦج൫ߏ੒

  7. ෳࡶͳґଘؔ܎Λ੍ޚͭͭ͠ ϫʔΫϑϩʔͷՄಡੑΛอͭ

  8. ϓϩδΣΫτΛػೳ୯ҐͰ෼ׂ ϓϩδΣΫτͱ͸ In Digdag, workflows are packaged together with other

    files used in the workflows. The files can be anything such as SQL scripts, Python/Ruby/Shell scripts, configuration files, etc. This set of the workflow definitions is called project. ެࣜυΩϡϝϯτ(http://docs.digdag.io/)ΑΓҾ༻ ϚΠΫϩΞυͰ͸ݱࡏ໿60ݸͷϓϩδΣΫτ͕ಈ͍͍ͯΔ
  9. ϓϩδΣΫτ಺ͷґଘؔ܎ schedule: daily>: 12:00:00 +task1: _parallel: true +subtask1: call>: subtask1.dig

    +subtask2: call>: subtask2.dig +task2: echo>: task finished successfully •callΦϖϨʔλΛ࢖͏͜ͱͰdigϑΝΠϧ ͷ෼ׂΛߦ͏͜ͱ͕Մೳ •requireΛ࢖͏ͱ΋͏গ͠ෳࡶͳDAGͷ දݱ΋Մೳ subtask1 subtask2 task2
  10. ϓϩδΣΫτؒͷґଘؔ܎ ϓϩδΣΫτA ϓϩδΣΫτB ௚઀ଞͷϓϩδΣ Ϋτͷ݁ՌΛݟΔ ͜ͱ͸ग़དྷͳ͍

  11. ϓϩδΣΫτؒͷґଘؔ܎ +touch_task: s3_touch>: bucket/flag/fileX +wait_task: s3_wait>: bucket/flag/fileX ϓϩδΣΫτB ϓϩδΣΫτA fileX

    ࣗ࡞ΦϖϨʔλ ࢀߟ:https://github.com/ tosametal/digdag-plugins
  12. ͦͷଞ ϫʔΫϑϩʔશମΛႈ౳ʹ͢Δ • hiveΫΤϦ͸insert overwrite • distcp͸overwrite deleteΦϓγϣϯΛࢦఆ ϦτϥΠΛઃఆ͢Δ •

    exponential interval
  13. ·ͱΊ • ϓϩδΣΫτ͸ංେԽ͠ͳ͍Α͏ʹػೳͰ෼ׂ • ϓϩδΣΫτؒͷґଘ͸s3_waitͰղܾ • Α͘࢖͏ػೳ͸ϓϥάΠϯΛ࡞Ζ͏

  14. None