Upgrade to Pro — share decks privately, control downloads, hide ads and more …

embulk, digdagによるデータ基盤構築

embulk, digdagによるデータ基盤構築

株式会社タイミーのデータ基盤構築に関してのスライド

Toshiki Tsuchikawa

June 09, 2020
Tweet

More Decks by Toshiki Tsuchikawa

Other Decks in Programming

Transcript

 1. 2020/06/09
  ୈ4ճ σʔλΞʔΩςΫτʢσʔλ੔උਓʣΛ”લ޲͖ʹ”ߟ͑Δձ
  embulk, digdagʹΑΔσʔλج൫ߏங
  גࣜձࣾλΠϛʔ ౔઒ູੜ
  1

  View full-size slide

 2. ࣗݾ঺հ
  ౔઒ ູੜ (Tsuchikawa Toshiki)
  • 2020೥3݄ ౦ژ޻ۀେֶ৘ใཧ޻ֶӃଔۀ
  • DRE, GrowthTeam @ Timee, Inc.
  - σʔλج൫ߏஙɺӡ༻

  - ෼ੳɺABςετ

  • େֶӃޙظ՝ఔ @ ౦޻େ
  - εʔύʔίϯϐϡʔλɺػցֶशؔ࿈

  • Twitter @tvtg_24
  2

  View full-size slide

 3. ߏஙͨ͠σʔλج൫ਤ
  3
  3

  View full-size slide

 4. embulk, digdagͱ͸ʁ
  • ΦʔϓϯιʔεͷόϧΫσʔλసૹπʔϧ

  • ༷ʑͳϓϥάΠϯΛ༻͍ͯɺinput, output, filterͳͲΛࢦఆ͠ɺɹ
  ॊೈͳσʔλసૹ͕Մೳ

  • ฒྻॲཧʹΑΓ୹࣌ؒͰసૹՄೳ
  • λεΫͷ࣮ߦɺεέδϡʔϦϯάɺϞχλϦϯά͢ΔͨΊͷπʔϧ

  • άϧʔϓԽͳͲΛ࢖͏͜ͱͰෳࡶͳϫʔΫϑϩʔΛఆٛͰ͖Δ

  • Τϥʔॲཧ΍ϦτϥΠ࣮ߦ͕؆୯ʹॻ͚Δ
  4
  4

  View full-size slide

 5. σʔλϕʔε → BigQuery
  5

  View full-size slide

 6. σʔλϕʔε → BigQuery
  σʔλϕʔε͔Β1ςʔϒϧͣͭσʔλΛembulkʹinput
  embulk಺ͰϚεΩϯάॲཧΛͯ͠BigQueryʹग़ྗ
  digdagͰεέδϡʔϦϯά͠ɺࠩ෼࣮ߦ
  6
  6

  View full-size slide

 7. σʔλϕʔε → BigQuery
  σʔλϕʔε͔Β1ςʔϒϧͣͭσʔλΛembulkʹinput
  embulk಺ͰϚεΩϯάॲཧΛͯ͠BigQueryʹग़ྗ
  digdagͰεέδϡʔϦϯά͠ɺࠩ෼࣮ߦ
  7
  7

  View full-size slide

 8. σʔλϕʔε → BigQuery
  8
  8
  શςʔϒϧ໊
  σʔλऔಘ
  1ςʔϒϧ͝ͱ
  • ςʔϒϧ͸embulkͰҰͭͮͭॲཧ

  ‣ ςʔϒϧ͝ͱʹΧϥϜ͕ҟͳΔ (ΧϥϜ໊ΛεΩʔϚ໊ͱͯ͠ఆٛ)

  ‣ ϚεΩϯάͷॲཧ͕ҟͳΔ (ruby_procϓϥάΠϯ࢖༻)

  ‣ ྫ: ి࿩൪߸ 090-5333-2222 → 080-9446-3523
  ϚεΩϯά

  View full-size slide

 9. σʔλϕʔε → BigQuery
  σʔλϕʔε͔Β1ςʔϒϧͣͭσʔλΛembulkʹinput
  embulk಺ͰϚεΩϯάॲཧΛͯ͠BigQueryʹग़ྗ
  digdagͰεέδϡʔϦϯά͠ɺࠩ෼࣮ߦ
  9
  9

  View full-size slide

 10. σʔλϕʔε → BigQuery
  10
  10
  table_A
  - id

  - …

  - …

  - updated_at
  ࠷ޙͷupdated_atΛอଘ
  > updated_at
  SELECT * EXCEPT(rn)

  FROM (SELECT *, row_number() over (PARTITION BY id ORDER BY updated_at DESC) AS rn

  FROM (SELECT * FROM BQ_DATASET.`{0}`))

  WHERE rn = 1

  ORDER BY id".format(digdag.env.params['UPDATE_TABLE'])
  idͰpartition byͯ͠updated_atͰorder byͯ͠৽͍͠σʔλ͚ͩΛऔಘ
  https://tech.mercari.com/entry/2018/06/28/100000

  View full-size slide

 11. ϩά৘ใ(S3ͳͲ) → BigQuery
  11

  View full-size slide

 12. ϩά৘ใ(S3ͳͲ) → BigQuery
  12
  12
  ετϨʔδͷϩά͔Βཉ͍͠ϩάΛநग़
  ϩάΛembulk, BigQueryʹରԠͨ͠ܗࣜʹՃ޻
  ετϨʔδͷ೔෇৘ใΛ΋ͱʹࠩ෼࣮ߦ

  View full-size slide

 13. ϩά৘ใ(S3ͳͲ) → BigQuery
  13
  13
  {method:”GET”…

  [cont-init.d…

  {severity: …

  {method:”POST”

  ༷ʑͳܗࣜͷ
  ϩά͕ࠞࡏ
  cat $file | jq -e -r -R
  'fromjson? | .log' | jq . -c |
  grep '^{"method' | sponge
  $file
  s3://.../2020/03/10/09
  {method:”GET”…

  {method:”POST”

  s3://.../2020/03/10/09
  ϑΝΠϧΛ1ͭ
  ͮͭॲཧ͢Δ
  ཉ͍͠ΧϥϜ͚ͩΛऔ
  Γग़͠ɺBigQueryͷ
  εΩʔϚͱͯ͠ࢦఆ
  ※ @ͳͲɺεΩʔϚ໊ʹ
  ରԠ͠ͳ͍จࣈ͕͋Δ
  ͦ΋ͦ΋ɺཉ͍͠ϩά͕ೖͬͯͳ͍͜ͱ΋…!
  (timestampͳͲ)

  ࠷௿ݶཉ͍͠ϩά৘ใΛࣄલʹܝࣔ
  ௥Ճͷࡍ͸DRE͕ίʔσΟϯά

  View full-size slide

 14. Τϥʔॲཧ
  14

  View full-size slide

 15. Τϥʔॲཧ
  15
  15
  ਺ेճʹҰճఔ౓BigQueryͷಉظ͕ࣦഊ͢Δ
  https://github.com/szyn/digdag-slack

  View full-size slide

 16. Τϥʔॲཧ
  16
  16
  εΩʔϚมߋɺ࡟আͷࡍʹΤϥʔ͕ग़Δ (ओʹσʔλϕʔε)
  table_A
  - id

  - …

  - name

  - updated_at
  table_A
  - id

  - …

  - last_name

  - updated_at

  1౓શσʔλΛ࡟আͯ͠ೖΕ௚͢ඞཁ͕͋Δ

  (खಈͰશମʹߋ৽Λ͔͚ΔΑ͏ͳϫʔΫϑ
  ϩʔΛಛఆͷςʔϒϧͰಈ͔͢)
  ΤϥʔจຖճಡΈͨ͘ͳ͍…

  ୭͕͜ͷΤϥʔղফ͢Δͷ…?

  View full-size slide

 17. Τϥʔॲཧ
  17
  17
  εΩʔϚมߋɺ࡟আͷࡍʹΤϥʔ͕ग़Δ (ओʹσʔλϕʔε)
  table_A
  - id

  - …

  - name

  - updated_at
  table_A
  - id

  - …

  - last_name

  - updated_at
  ❌ ୲౰ऀ͕Θ͔Γ΍͍͢!!
  σόοά΋͠΍͍͢!!
  Pull Request
  εΩʔϚ৘ใΛ؂ࢹ
  issueͱͯ͠௥Ճ
  ୲౰ऀΛࢦ໊
  ୲౰ऀʹ௨஌

  View full-size slide