Upgrade to Pro — share decks privately, control downloads, hide ads and more …

embulk, digdagによるデータ基盤構築

embulk, digdagによるデータ基盤構築

株式会社タイミーのデータ基盤構築に関してのスライド

Toshiki Tsuchikawa

June 09, 2020
Tweet

More Decks by Toshiki Tsuchikawa

Other Decks in Programming

Transcript

  1. 2020/06/09
    ୈ4ճ σʔλΞʔΩςΫτʢσʔλ੔උਓʣΛ”લ޲͖ʹ”ߟ͑Δձ
    embulk, digdagʹΑΔσʔλج൫ߏங
    גࣜձࣾλΠϛʔ ౔઒ູੜ
    1

    View Slide

  2. ࣗݾ঺հ
    ౔઒ ູੜ (Tsuchikawa Toshiki)
    • 2020೥3݄ ౦ژ޻ۀେֶ৘ใཧ޻ֶӃଔۀ
    • DRE, GrowthTeam @ Timee, Inc.
    - σʔλج൫ߏஙɺӡ༻

    - ෼ੳɺABςετ

    • େֶӃޙظ՝ఔ @ ౦޻େ
    - εʔύʔίϯϐϡʔλɺػցֶशؔ࿈

    • Twitter @tvtg_24
    2

    View Slide

  3. ߏஙͨ͠σʔλج൫ਤ
    3
    3

    View Slide

  4. embulk, digdagͱ͸ʁ
    • ΦʔϓϯιʔεͷόϧΫσʔλసૹπʔϧ

    • ༷ʑͳϓϥάΠϯΛ༻͍ͯɺinput, output, filterͳͲΛࢦఆ͠ɺɹ
    ॊೈͳσʔλసૹ͕Մೳ

    • ฒྻॲཧʹΑΓ୹࣌ؒͰసૹՄೳ
    • λεΫͷ࣮ߦɺεέδϡʔϦϯάɺϞχλϦϯά͢ΔͨΊͷπʔϧ

    • άϧʔϓԽͳͲΛ࢖͏͜ͱͰෳࡶͳϫʔΫϑϩʔΛఆٛͰ͖Δ

    • Τϥʔॲཧ΍ϦτϥΠ࣮ߦ͕؆୯ʹॻ͚Δ
    4
    4

    View Slide

  5. σʔλϕʔε → BigQuery
    5

    View Slide

  6. σʔλϕʔε → BigQuery
    σʔλϕʔε͔Β1ςʔϒϧͣͭσʔλΛembulkʹinput
    embulk಺ͰϚεΩϯάॲཧΛͯ͠BigQueryʹग़ྗ
    digdagͰεέδϡʔϦϯά͠ɺࠩ෼࣮ߦ
    6
    6

    View Slide

  7. σʔλϕʔε → BigQuery
    σʔλϕʔε͔Β1ςʔϒϧͣͭσʔλΛembulkʹinput
    embulk಺ͰϚεΩϯάॲཧΛͯ͠BigQueryʹग़ྗ
    digdagͰεέδϡʔϦϯά͠ɺࠩ෼࣮ߦ
    7
    7

    View Slide

  8. σʔλϕʔε → BigQuery
    8
    8
    શςʔϒϧ໊
    σʔλऔಘ
    1ςʔϒϧ͝ͱ
    • ςʔϒϧ͸embulkͰҰͭͮͭॲཧ

    ‣ ςʔϒϧ͝ͱʹΧϥϜ͕ҟͳΔ (ΧϥϜ໊ΛεΩʔϚ໊ͱͯ͠ఆٛ)

    ‣ ϚεΩϯάͷॲཧ͕ҟͳΔ (ruby_procϓϥάΠϯ࢖༻)

    ‣ ྫ: ి࿩൪߸ 090-5333-2222 → 080-9446-3523
    ϚεΩϯά

    View Slide

  9. σʔλϕʔε → BigQuery
    σʔλϕʔε͔Β1ςʔϒϧͣͭσʔλΛembulkʹinput
    embulk಺ͰϚεΩϯάॲཧΛͯ͠BigQueryʹग़ྗ
    digdagͰεέδϡʔϦϯά͠ɺࠩ෼࣮ߦ
    9
    9

    View Slide

  10. σʔλϕʔε → BigQuery
    10
    10
    table_A
    - id

    - …

    - …

    - updated_at
    ࠷ޙͷupdated_atΛอଘ
    > updated_at
    SELECT * EXCEPT(rn)

    FROM (SELECT *, row_number() over (PARTITION BY id ORDER BY updated_at DESC) AS rn

    FROM (SELECT * FROM BQ_DATASET.`{0}`))

    WHERE rn = 1

    ORDER BY id".format(digdag.env.params['UPDATE_TABLE'])
    idͰpartition byͯ͠updated_atͰorder byͯ͠৽͍͠σʔλ͚ͩΛऔಘ
    https://tech.mercari.com/entry/2018/06/28/100000

    View Slide

  11. ϩά৘ใ(S3ͳͲ) → BigQuery
    11

    View Slide

  12. ϩά৘ใ(S3ͳͲ) → BigQuery
    12
    12
    ετϨʔδͷϩά͔Βཉ͍͠ϩάΛநग़
    ϩάΛembulk, BigQueryʹରԠͨ͠ܗࣜʹՃ޻
    ετϨʔδͷ೔෇৘ใΛ΋ͱʹࠩ෼࣮ߦ

    View Slide

  13. ϩά৘ใ(S3ͳͲ) → BigQuery
    13
    13
    {method:”GET”…

    [cont-init.d…

    {severity: …

    {method:”POST”

    ༷ʑͳܗࣜͷ
    ϩά͕ࠞࡏ
    cat $file | jq -e -r -R
    'fromjson? | .log' | jq . -c |
    grep '^{"method' | sponge
    $file
    s3://.../2020/03/10/09
    {method:”GET”…

    {method:”POST”

    s3://.../2020/03/10/09
    ϑΝΠϧΛ1ͭ
    ͮͭॲཧ͢Δ
    ཉ͍͠ΧϥϜ͚ͩΛऔ
    Γग़͠ɺBigQueryͷ
    εΩʔϚͱͯ͠ࢦఆ
    ※ @ͳͲɺεΩʔϚ໊ʹ
    ରԠ͠ͳ͍จࣈ͕͋Δ
    ͦ΋ͦ΋ɺཉ͍͠ϩά͕ೖͬͯͳ͍͜ͱ΋…!
    (timestampͳͲ)

    ࠷௿ݶཉ͍͠ϩά৘ใΛࣄલʹܝࣔ
    ௥Ճͷࡍ͸DRE͕ίʔσΟϯά

    View Slide

  14. Τϥʔॲཧ
    14

    View Slide

  15. Τϥʔॲཧ
    15
    15
    ਺ेճʹҰճఔ౓BigQueryͷಉظ͕ࣦഊ͢Δ
    https://github.com/szyn/digdag-slack

    View Slide

  16. Τϥʔॲཧ
    16
    16
    εΩʔϚมߋɺ࡟আͷࡍʹΤϥʔ͕ग़Δ (ओʹσʔλϕʔε)
    table_A
    - id

    - …

    - name

    - updated_at
    table_A
    - id

    - …

    - last_name

    - updated_at

    1౓શσʔλΛ࡟আͯ͠ೖΕ௚͢ඞཁ͕͋Δ

    (खಈͰશମʹߋ৽Λ͔͚ΔΑ͏ͳϫʔΫϑ
    ϩʔΛಛఆͷςʔϒϧͰಈ͔͢)
    ΤϥʔจຖճಡΈͨ͘ͳ͍…

    ୭͕͜ͷΤϥʔղফ͢Δͷ…?

    View Slide

  17. Τϥʔॲཧ
    17
    17
    εΩʔϚมߋɺ࡟আͷࡍʹΤϥʔ͕ग़Δ (ओʹσʔλϕʔε)
    table_A
    - id

    - …

    - name

    - updated_at
    table_A
    - id

    - …

    - last_name

    - updated_at
    ❌ ୲౰ऀ͕Θ͔Γ΍͍͢!!
    σόοά΋͠΍͍͢!!
    Pull Request
    εΩʔϚ৘ใΛ؂ࢹ
    issueͱͯ͠௥Ճ
    ୲౰ऀΛࢦ໊
    ୲౰ऀʹ௨஌

    View Slide