Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Composer & Dataflow によるバッチETLの再構築 #data_ml_engineering / 20190719

Cloud Composer & Dataflow によるバッチETLの再構築 #data_ml_engineering / 20190719

データとML周辺エンジニアリングを考える会#2の発表資料です。
https://data-engineering.connpass.com/event/136756/

yuzutas0

July 19, 2019
Tweet

More Decks by yuzutas0

Other Decks in Technology

Transcript

  1. Cloud Composer & Dataflow ʹΑΔ
 όονETLͷ࠶ߏங 2019-07-19
 #data_ml_engineering
 presented by

    @yuzutas0 
 https://www.pexels.com/photo/architecture-blur-building-colourful-392031/   https://www.pexels.com/photo/architecture-blur-building-colourful-392031/
  2.   ɹ੒௕௧ ϓϩμΫτˢ σʔλˢ ෛՙˢ ར༻ऀˢ Good Good Bad

    Bad ❌ γεςϜɺବ໨Ͱ͢ʂ ߋ৽͞Ε͍ͯͳ͍΍Μʂ
  3.   ɹྺ࢙తܦҢ ETL System ETL
 for US ETL
 for

    JP ࡞ͬͨʂ ϝϯςʂ US Team ຊۀͷ๣Β ળҙͰαϙʔτ
 ʢਖ਼௚ݶք͕͋Δʣ JP SRE JP BI JP΋ཉ͍͠ʂ
 ૬৐Γͤͯ͞ʂ ґཔ USΞϓϦΛ ྑ͘͢Δͧʂ JPΞϓϦຊ൪؀ڥ
 ͕࠷༏ઌͩʂ ෼ੳۀ຿ʹ ઐ೦͢Δͧʂ ETL
 for UK
  4.   ɹ͜ͷҊ݅ͷείʔϓᶄ Other Product
 DB .POPMJUI "11#& Other Other

    BigQuery ॱ࣍Ҡ؅༧ఆ Read Only Replica ػີ৘ใ ϚεΩϯά DB .JDSP
 TFSWJDFT DB .JDSP
 TFSWJDFT DB .JDSP
 TFSWJDFT ੴङDC GCP
  5. ຖ࣮࣌ߦ dataset.__TABLES__ ΛSELECT
 ϝλ৘ใΛεφοϓγϣοτอଘ pandas.read_csv() Ͱऔಘ
 νΣοΫ࣌ؒɺର৅ςʔϒϧ
 ௨஌ઌνϟϯωϧ pandas.read_gbq() Ͱ

    ςʔϒϧ໊ͱ
 ࠷ऴߋ৽೔࣌Λऔಘ ߋ৽༗ແΛ൑ఆ slackweb.Slack(). notify() Ͱ ࢦఆνϟϯωϧʹ௨஌   ɹBQ update checker / implementation IUUQTXXXqBUJDPODPNGSFFJDPODTW@ ύωϧσʔλΛ෼ੳͰ͖ΔΑ͏ʹ஝ੵ
  6.   ɹՄࢹԽ → ߹ҙܗ੒ ՝୊ ղܾ ΞφϦετ ʮࢭ·ͬͯΔʂʯ ʮࠓ΄͍͠ʂ࢑ఆରԠΛʂʯ

    σϕϩού ʮݴ͏΄Ͳ͔ʁʯ ʮ࠶ߏஙͨ͠΄͏͕͍͍ʂʯ ༏ઌॱΛ্͛ͯରԠʂ
  7. ̋ ̋ ˕ ˕   ɹൺֱݕ౼ γεςϜ αϙʔτ 64


    &5-4ZTUFN "JSqPXPO(,& 4QBSLFBSMZ
 
 νϡʔχϯά͢Ε͹ػೳཁ݅ΛຬͨͤΔ ͸ͣ ஍ཧɾ͕࣌ࠩ͋Δ
 
 ඇಉظͰ૬ஊ͸Մೳ .FSQBZ
 #BUDI1JQFMJOF $MPVE$PNQPTFS %BUBqPXMBUFMZ
 
 ػೳཁ݅ΛຬͨͤΔ
 GVMMNBOBHFEͰ૬ରతʹ࢖͍΍͍͢ ͸ͣ ෺ཧతʹΦϑΟε͕͍ۙ
 
 ૬ஊ͠΍͍͢
  8. ໌Β͔ʹ “ETLγεςϜઃܭ” ͷ໰୊Ͱ͸ͳ͘
 ”JPઐ೚ϝϯςφͷ௕ظෆࡏ” ͱ “ͦ͏ͳΔʹࢸͬͨ૊৫తྗֶ” ͕
 ਅʹղ͘΂͖Πγϡʔ 
 


    “σʔλૄ௨͕ࢭ·͍ͬͯΔ” ͸ණࢁͷҰ֯
 ͳΔ΂͘ϚΠϯυγΣΞΛׂ͔ͣʹࡁΉΑ͏ʹ
 “͍͔ʹٕज़໘ͰϥΫͯ͠ରԠ͢Δ͔” ͕ҙࢥܾఆͷ࣠ͱͳΔ   ɹҙࢥܾఆͷϙΠϯτ IUUQTXXXJSBTVUPZBDPNCMPHQPTU@IUNM
  9.   ɹՄࢹԽ → ߹ҙܗ੒ ՝୊ ղܾ ΞφϦετ ʮࢭ·ͬͯΔʂʯ ʮࠓ΄͍͠ʂ࢑ఆରԠΛʂʯ

    σϕϩού ʮݴ͏΄Ͳ͔ʁʯ ʮ࠶ߏஙͨ͠΄͏͕͍͍ʂʯ ϑΥʔΧε͢Δ ࢑ఆରԠʹ࣌ؒɾ࿑ྗΛׂ͔ͳ͍
  10.   ɹWhy Dataflow? ɾmysqldumpͷTSVϑΥʔϚοτͰ͸BigQueryʹLoadͰ͖ͳ͍ → ཁ੔ܗ
 ɹɹɾdouble-quotation-marks escaped by

    double-quotation-marks in double-quotation-marks
 ɹɹɾnew-line escaped by double backslashes 
 ɾσʔλྔ͕ଟ͍ͷͰDBෛՙˍύϑΥʔϚϯε؍఺͔Β
 ɹεέʔϥϏϦςΟͷߴ͍DataflowʹॲཧΛدͤͨ
 
 ɾDataflow͸ม׵૷ஔͱͯ͠੹຿ΛׂΓ੾͍ͬͯΔͷͰ ɹDataflow → BigQuery ʹ௚LoadͤͣɺGCSʹม׵ޙϑΝΠϧΛஔ͍͍ͯΔ
 
 ɾ࣮ߦ؀ڥ͸Python3.5 (supported at Apache Beam 2.11.0 / Mar 5, 2019)
  11.   ɹComposer → BQ: ࠩ෼ߋ৽ ݩςʔϒϧ + tmpςʔϒϧ
 ˠ

    Union ALL → ॏෳআڈ → ্ॻ͖ tmpςʔϒϧΛ࡟আ ࠩ෼σʔλΛtmpςʔϒϧʹload ৄ͘͠͸ҎԼͷهࣄΛࢀর͍ͩ͘͞ʂ ਺ඦGBͷσʔλΛMySQL͔ΒBigQuery΁ಉظ͢Δ
 https://tech.mercari.com/entry/2018/06/28/100000
  12.   ɹRebuilt BQ / docs for user (3) ʢ൒݄์ஔ͞Ε͍ͯΔʣݱঢ়ΑΓ͸


    ʮϚγʹͳΔʯͰσʔλར༻ऀͱѲΔ ɹɾա৒඼࣭ʹ͠ͳ͍ ɹɾܭଌʢ஗Ԇ؂ࢹʣͱαϙʔτ͸໌ه ɹɾᐆດͳ΋ͷ͸ᐆດͰ͋Δ͜ͱΛ໌ه
  13. Sprint + Increment: ܧଓతվળͷϦζϜΛ࡞Δ   ɹִिස౓Ͱஈ֊ϦϦʔε W W W

    0QT Ұ෦ͷνʔϜ͔Βఏڙ ࣍ͷνʔϜʹ΋ఏڙ ʜʜ ར༻Ҋ಺W
 2"ɾϑΟʔυόοΫ ར༻Ҋ಺W 2"ɾϑΟʔυόοΫ ʜʜ %BUB શ݅ߋ৽ͰࡁΉςʔϒϧ ࠩ෼ߋ৽͠ͳ͍ͱਏ͍ςʔϒϧ ʜʜ NZTRMEVNQͰ$47ϑΝΠϧ͕
 (#ҎԼʹ෼ׂ͞ΕΔςʔϒϧ %BUBqPXͰ$47Λ෼ׂ͠ͳ͍ͱ
 #2-PBE͕ࣦഊ͢Δςʔϒϧ ʜʜ վળ վળ վળ վળ վળ վળ վળ վળ
  14. [BI / PM] @mattsun, @shoei, @hase-ryo, @hikaru, @nakatomo,
 ɹɹɹɹ @natsume,

    @igachan-san, @tsudar, @anboo, @hiza
 [JP Dev] @siroken3, @shoe116, @ichirin2501, @bokko, @catatsuy, @shinpei 
 [Merpay Dev] @laughingman7743, @syucream, @cocoiti, @kazegusuri, @sfujjiwara 
 [US Dev/ML] @hatone, @yu 
 [JP ML / Search] @furusawa, @tairosan   ɹSpecial Thanks account-name in team Slack
  15.   ɹࠓޙͷ՝୊ of Batch ETL in Mercari JP ୹ظ

    l࢖ΘΕΔzج൫ͷຏ͖ࠐΈ ϓϩμΫτϚωδϝϯτ γεςϜ։ൃ XJUI#*43&%BUB1MBUGPSN தظ lഁյͱ૑଄z͔Βlܭଌͱվળz΁ͷγϑτ αʔϏεϚωδϝϯτʢ*5*-ʣ σʔλϚωδϝϯτʢ%.#0,ʣ XJUIIBTFSZPTBO ௕ظ lہॴ࠷దz͔Βͷ୤٫ શࣾσʔλઓུࡦఆʢ%BUB0QTʣ XJUIUBJSPTBO