$30 off During Our Annual Pro Sale. View Details »

DigdagでETL処理をする

 DigdagでETL処理をする

データとML周辺エンジニアリングを考える会 #2

https://data-engineering.connpass.com/event/136756/

#data_ml_engineering

tosametal

July 19, 2019
Tweet

More Decks by tosametal

Other Decks in Technology

Transcript

  1. DigdagͰETLॲཧΛ͢Δ
    σʔλͱMLपลΤϯδχΞϦϯάΛߟ͑Δձ #2
    2019.07.19
    த໺ᠳଠ(@tosametal)

    גࣜձࣾϚΠΫϩΞυ
    ΞϓϦέʔγϣϯΤϯδχΞ

    View Slide

  2. ϚΠΫϩΞυʹ͓͚Δػցֶश
    ޿ࠂ഑৴γεςϜʹ͓͚ΔCTR༧ଌɺCVR༧ଌɺෆਖ਼ΫϦοΫͷݕग़ͳͲ

    View Slide

  3. ϩάج൫ͷߏ੒
    Imp

    Server
    Click

    Server
    RTB
    Server
    Kafka Hadoop
    (σʔλ΢ΣΞϋ΢ε)
    Digdag
    Hadoop
    (෼ੳج൫)

    View Slide

  4. ϩάج൫ͷߏ੒
    Imp

    Server
    Click

    Server
    RTB
    Server
    Kafka Hadoop
    (σʔλ΢ΣΞϋ΢ε)
    Digdag
    Hadoop
    (෼ੳج൫)
    at least once
    ϢχʔΫͳIDʹΑΔॏෳഉআ
    sessionͰ؅ཧ
    ႈ౳ͳॲཧ
    Kafka
    secondaryͰ
    kafkaΛࢦఆ
    jsonܗࣜͷ
    ߏ଄Խσʔλ

    View Slide

  5. Digdagͱ͸
    digϑΝΠϧʹએݴతʹϫʔΫϑϩʔΛهड़
    Workflow as code
    εέδϡʔϧ࣮ߦɺϦΧόϦ
    UI͔Βਐḿͷ֬ೝ΍࠶࣮ߦ͕Մೳ
    ΦϖϨʔλΛࣗ࡞Մೳ

    View Slide

  6. PostgreSQL
    ࣮ߦཤྺͳͲΛอଘ
    Task͝ͱʹhadoopΫϥΠΞϯτ
    ͱͳΔίϯςφΛ্ཱͪ͛Δ
    εέʔϧΞ΢τՄೳ
    όον࣮ߦج൫ߏ੒

    View Slide

  7. ෳࡶͳґଘؔ܎Λ੍ޚͭͭ͠
    ϫʔΫϑϩʔͷՄಡੑΛอͭ

    View Slide

  8. ϓϩδΣΫτΛػೳ୯ҐͰ෼ׂ
    ϓϩδΣΫτͱ͸
    In Digdag, workflows are packaged together with other files used in the workflows. The files can be
    anything such as SQL scripts, Python/Ruby/Shell scripts, configuration files, etc. This set of the workflow
    definitions is called project.
    ެࣜυΩϡϝϯτ(http://docs.digdag.io/)ΑΓҾ༻
    ϚΠΫϩΞυͰ͸ݱࡏ໿60ݸͷϓϩδΣΫτ͕ಈ͍͍ͯΔ

    View Slide

  9. ϓϩδΣΫτ಺ͷґଘؔ܎
    schedule:
    daily>: 12:00:00
    +task1:
    _parallel: true
    +subtask1:
    call>: subtask1.dig
    +subtask2:
    call>: subtask2.dig
    +task2:
    echo>: task finished successfully
    ●callΦϖϨʔλΛ࢖͏͜ͱͰdigϑΝΠϧ
    ͷ෼ׂΛߦ͏͜ͱ͕Մೳ
    ●requireΛ࢖͏ͱ΋͏গ͠ෳࡶͳDAGͷ
    දݱ΋Մೳ
    subtask1
    subtask2
    task2

    View Slide

  10. ϓϩδΣΫτؒͷґଘؔ܎
    ϓϩδΣΫτA
    ϓϩδΣΫτB
    ௚઀ଞͷϓϩδΣ
    Ϋτͷ݁ՌΛݟΔ
    ͜ͱ͸ग़དྷͳ͍

    View Slide

  11. ϓϩδΣΫτؒͷґଘؔ܎
    +touch_task:
    s3_touch>: bucket/flag/fileX
    +wait_task:
    s3_wait>: bucket/flag/fileX
    ϓϩδΣΫτB
    ϓϩδΣΫτA
    fileX
    ࣗ࡞ΦϖϨʔλ
    ࢀߟ:https://github.com/
    tosametal/digdag-plugins

    View Slide

  12. ͦͷଞ
    ϫʔΫϑϩʔશମΛႈ౳ʹ͢Δ
    ● hiveΫΤϦ͸insert overwrite
    ● distcp͸overwrite deleteΦϓγϣϯΛࢦఆ
    ϦτϥΠΛઃఆ͢Δ
    ● exponential interval

    View Slide

  13. ·ͱΊ
    ● ϓϩδΣΫτ͸ංେԽ͠ͳ͍Α͏ʹػೳͰ෼ׂ
    ● ϓϩδΣΫτؒͷґଘ͸s3_waitͰղܾ
    ● Α͘࢖͏ػೳ͸ϓϥάΠϯΛ࡞Ζ͏

    View Slide

  14. View Slide