Save 37% off PRO during our Black Friday Sale! »

こんな感じでデータパイプライン作ってます
2019年春

 こんな感じでデータパイプライン作ってます
2019年春

819238a23c270f774fa7d54c9d9aa244?s=128

Yuku TAKAHASHI

April 16, 2019
Tweet

Transcript

  1. ͜Μͳײ͡ͰσʔλύΠϓϥΠϯ࡞ͬͯ·͢
 ೥य़ !ZVLV@U  %BUB1JQFMJOF$BTVBM5BML

  2. ∁ڮါٱ UXJUUFSDPNZVLV@U
 HJUIVCDPNZVLV
 RJJUBDPNZVLV@U 4PGUXBSF&OHJOFFS!'-:8)&&- '-:8)&&-ͰϨίϝϯυΤϯδϯͱ
 ͦͷͨΊͷύΠϓϥΠϯΛ։ൃ͍ͯ͠·͢ɻ
 ೥݄ΑΓݱ৬ɻ

  3. ࠓ೔ͷ࿩ w σʔλύΠϓϥΠϯͬͯͦ΋ͦ΋ͳΜ͚ͩͬ w ϲ݄ؒύΠϓϥΠϯΛ࡞͖ͬͯͯͷॴײ w ෇࿥υΩϡϝϯτʹࡌͬͯͳ͍$MPVE$PNQPTFS5JQT

  4. SOLUTION DATA DATA PIPELINE

  5. w ༷ʑͳྺ࢙తܦҢͰग़དྷ্͕ͬͨ
 ෳࡶͰஅยԽͨ͠σʔλ w ϩά͕͋ΔΑ͏ͳͳ͍Α͏ͳ w ෆ҆ఆͳετϨʔδ DATA DATA PIPELINE

  6. w ͍ͬ͢͝ਓ޻஌ೳ͕৭ʑͳ໰୊Λ ͍͍ײ͡ʹղܾͯ͘͠ΕΔ΍ͭ SOLUTION DATA PIPELINE

  7. $0--&$5 .07&4503& &91-03&53"/4'03. "((3&("5&-"#&- -&"3/015*.*;& "*%&&1-&"3/*/( IUUQTIBDLFSOPPODPNUIFBJIJFSBSDIZPGOFFETGGDD 5)&%"5"4$*&/$& )*&3"3$):0'/&&%4 ਏ͍

    ָ͍͠
  8. CLIENT SOLUTION w ϨίϝϯυγεςϜ w ෺ྲྀ࠷దԽ w ݕࡧΤϯδϯ w ޿ࠂ഑৴γεςϜ

    w ͳͲ w ಺෦ʹ຾ΔେྔͷσʔλΛ׆༻ ͍ͨ͠ w ΞΠσΞΛ࣮ݱͰ͖Δਓࡐ͕͍ ͳ͍
  9. '-:8)&&-㱠σʔλ෼ੳ4BB4 w σʔλΛूΊΔͱ͜Ζ͔ΒιϦϡʔγϣϯΛ࡞Δͱ͜Ζ·Ͱ w ΫϥΠΞϯτͷ"84ΞΧ΢ϯτ΍($1ϓϩδΣΫτ಺ʹιϑτ΢ΣΞΛ௚઀ల։͍ͯ͠ Δɻ w ٻΊΔػೳ͕ͦ͜ʹ͋Δͱ෼͔͍ͬͯͯ΋ɺ֎෦ʹσʔλΛग़͢͜ͱ͕Ͱ͖ͳ͍اۀ ΍ۀछ͕ଘࡏ͢ΔɻʢओʹͰ͔͍ͬاۀʣ w

    ϚϧνΫϥ΢υ؀ڥԼͰ͍͔ʹιϑτ΢ΣΞࢿ࢈Λங͖ɺύΠϓϥΠϯͰܨ͍Ͱ͍͔͕͘ ࠓޙͷେ͖ͳٕज़తͳνϟϨϯδͷҰͭɻ
  10. ϚϧνΫϥ΢υύΠϓϥΠϯʂʂʂ

  11. ϚϧνΫϥ΢υύΠϓϥΠϯʂʂʂ γΣϧεΫϦϓτ͔Βͷ୤٫

  12. γΣϧεΫϦϓτʹΑΔύΠϓϥΠϯ w ΈΜͳੲ͸γΣϧεΫϦϓτͩͬͨ w DSPOͰఆظ࣮ߦ͞ΕΔγΣϧεΫϦϓτ͔Β࣮ߦ͞ΕΔ1ZUIPOεΫϦϓτ܈ w ਏ͞ w ෳࡶԽ͍ͯ͘͠ύΠϓϥΠϯʹରԠ͖͠Εͳ͍ w

    ͍ͭͲͷλεΫ͕ࣦഊͨ͠ͷ͔෼͔Βͳ͍ w ࣦഊͨ͠λεΫ͸TTIͯ͠࠶࣮ߦ
  13. CLIENT
 DATA ϨίϝϯυύΠϓϥΠϯWʢγΣϧεΫϦϓτʣ JOINED DATA RECOMMENDER SYSTEM

  14. CLIENT
 DATA DATA LAKE DATA WAREHOUSE DATA
 MART DATA SCIENTISTS

    ϨίϝϯυύΠϓϥΠϯWʢ$MPVE$PNQPTFSʣ RECOMMENDER SYSTEM
  15. جຊํ਑ w σʔλج൫ͷ෼ྨͱਐԽతσʔλϞσϦϯά w %"5"-",&ˠ%"5"8"3&)064&ˠ%"5"."35 w σʔλ͸#JH2VFSZʹ஝ੵ͠ɺۃྗ42-Λ࢖͏ɻ42-Ͱ͸೉͍͠ͱ͜Ζ͚ͩ$MPVE %BUBQSPDʢ4QBSLʣΛ࢖࣮ͬͯ૷͢Δɻ w ϫʔΫϑϩʔ͸$MPVE$PNQPTFSʢ"JSqPXʣͰ؅ཧ͢Δɻ

  16. #JH2VFSZͷ̏֊૚ w %"5"-",& w ΫϥΠΞϯτ͔Βఏڙ͞ΕΔੜσʔλΛ஝ੵ͢Δ w %"5"8"3&)064& w அยԽͨ͠σʔλΛ෮ݩɾඇਖ਼نԽɺ໋໊نଇʹҰ؏ੑΛ΋ͨͤΔɺ/6--Λഉআͯ͠ར༻͠ ΍͘͢͢ΔɺͳͲ

    w ΫϥΠΞϯτͷσʔλ෼ੳνʔϜʹఏڙ͢Δ͜ͱ΋ w %"5"."35 w $MPVE%BUBMBCͳͲͷ#*πʔϧ͔Βࢀর͢Δ w ఏڙ͍ͯ͠ΔϨίϝϯυγεςϜͷޮՌଌఆͱ͔
  17. γΣϧεΫϦϓτஔ͖׵͑ਐΊͯΈͯ w "JSqPX͕ͲΜͳʹਏͯ͘΋ੲΛࢥ͍ग़ͤ͹ؤுΕΔ w 8FC6*͕͍͖ͭͯͯخ͍͠ w ؆୯ʹ࠶࣮ߦͰ͖ΔΑ͏ʹͳͬͯλεΫͷႈ౳ੑΛҙࣝ͢ΔΑ͏ʹͳͬͨ w ʢσʔλΛอଘ͢Δͱ͜Ζ͔Βίϯαϧ͠ͳ͍ͱμϝͳͷͰ͸ʜʁʣ

  18. ෇࿥υΩϡϝϯτʹࡌͬͯͳ͍
 $MPVE$PNQPTFSͷ஌ݟ

  19. ͚ͬ͜͏(,&LTྗΛٻΊΒΕΔ w $MPVE$PNQPTFS͸(,&ͷ্ʹσϓϩΠ͞ΕΔϑϧϚωʔδυ"JSqPXαʔϏεɻ
 ࠔͬͨ࣌ʹ͸(,&ܦ༝Ͱ"JSqPXʹ઀ଓͯ͠σόοάͨ͠Γ͢Δඞཁ͕͋ΔͷͰɺ
 (,&ͱLTʹ͍ͭͯͷجૅ஌ࣝ͘Β͍͸͍࣋ͬͯͳ͍ͱͭΒ͍ɻ w ͱΓ͋͑ͣLVCFDUMΛηοτΞοϓ͓ͯ͘͠ɻ GKE_CLUSTER="$(gcloud composer environments

    describe $COMPOSER_NAME \ --format='get(config.gkeCluster)')" GKE_LOCATION="$(gcloud composer environments describe $COMPOSER_NAME \ --format='get(config.nodeConfig.location)')" gcloud container clusters get-credentials $GKE_CLUSTER \ --zone $GKE_LOCATION
  20. ϝϞϦ͕଍Γͳ͍ͱ໧ͬͯࢮ͵ w ϩάΛు͔ͣʹλεΫ͕ࣦഊ͢Δͱ͖͸ɺϝϞϦෆ଍ͰBJSqPXXPSLFS͝ͱLTʹࡴ͞Ε ͍ͯΔՄೳੑ͕͋Δɻ w ,VCFSOFUFT&OHJOF8PSLMPBETBJSqPXXPSLFSͰ&WJDUFEͳ1PEΛબ୒͢Δ͜ͱͰ ࢮҼΛ֬ೝՄೳɻ w OTUBOEBSEͩͱ"1*Λݺͼग़͚ͩ͢ͷ1ZUIPO0QFSBUPSͰ΋كʹࡴ͞ΕΔɻ w

    &WJDUFEͳ1PEΛҰ૟͢ΔίϚϯυͰఆظతʹ($͢Δɻ kubectl get pods -l run=airflow-worker \ | grep Evicted \ | awk '{print $1}' \ | xargs kubectl delete pod
  21. 1ZUIPOύοέʔδ͕ඍົʹݹ͍ w $MPVE$PNQPTFSͷBJSqPXXPSLFSʹ͸͍͔ͭ͘ͷ1ZUIPOύοέʔδ͕Πϯετʔϧ͞ Ε͍ͯΔ͕ɺυΩϡϝϯτʹόʔδϣϯ͕ॻ͔Ε͍ͯͳ͍ɻ͔͠΋ඍົʹݹ͍ɻ w 1PEͷதͰ1ZUIPOΛىಈͯ͠௚઀ௐ΂Δͷ͕खͬऔΓૣ͍ɻ

  22. HTVUJMͰσϓϩΠͰ͖Δ w "JSqPX͸ϩʔΧϧϑΝΠϧΛϙʔϦϯά͢Δ͜ͱͰ%"(ఆٛϑΝΠϧΛݕग़͢Δɻ $MPVE$PNQPTFSͰ͸($4όέοτΛ($4'64&ͰϚ΢ϯτ͢Δ͜ͱͰ࣮ݱ͍ͯ͠ Δɻ w ($4ʹϑΝΠϧΛஔ͖͑͢͞Ε͹͍͍ͷͰɺHDMPVE4%,ͷTUPSBHFEBHTJNQPSUίϚϯ υ͸࢖Θͳͯ͘΋͍͍ɻ w $JSDMF$*͔ΒHTVUJMSTZODͰ؆୯ࣗಈσϓϩΠɻ

  23. "JSqPX8FCϖʔδ΁ͷඈͼํ w ϒοΫϚʔΫͰ΋͍͍͕$PNQPTFS&OWJSPONFOUΛ࡞Γ௚͢ͱ63*͕มΘͬͯ͠·͏ɻ ͔ͱ͍ͬͯ$MPVE$POTPMFʹ౎౓ΞΫηε͢Δͷ͸໘౗ɻ w HDMPVE4%,Ͱ63*ΛऔಘͰ͖ΔͷͰɺͦΕΛPQFOίϚϯυʹ౉ͤ͹͍͍ɻ gcloud composer environments describe

    $COMPOSER_NAME \ --format='get(config.airflowUri)’
  24. "JSqPXͷόʔδϣϯΞοϓ͕ਏ͍ w ʹϕʔλػೳͱͯ͠ఏڙ։࢝͞Ε͕ͨɺυΩϡϝϯτͷ௨Γಈ͔ͳ͍ɻ w Ҏલ͸HJUIVCDPN(PPHMF$MPVE1MBUGPSNQZUIPOEPDTTBNQMFTʹೖ͍ͬͯΔ DPQZ@FOWJSPONFOUQZͱ͍͏εΫϦϓτΛ࢖͏Α͏ʹͳ͍ͬͯͨɻ w DPQZ@FOWJSPONFOUQZΛಡΊ͹ҰԠԿΛͲ͜ʹίϐʔ͢Ε͹͍͍ͷ͔෼͔Δ͕ɺ
 "JSqPX͚ͩͰͳ͘LTྗ΋ͳ͍ͱ͔ͳΓݫ͍͠ɻ

  25. $MPVE.FNPSZTUPSFʹ઀ଓͰ͖ͳ͍ w (,&͔Β.FNPSZTUPSFʢ3FEJTʣʹ઀ଓ͢Δʹ͸Ϋϥελ࡞੒࣌ʹ*1ΤΠϦΞεΛ༗ޮ ʹ͠ͳ͚Ε͹͍͚ͳ͍͕ɺ$MPVE$PNQPTFS͸ແޮʹͯ͠࡞ͬͯ͠·͏ɻ
 ແޮͷ৔߹ΫϥελʹJQUBCMFTͷϧʔϧΛ௥Ճ͢Δඞཁ͕͋Δɻ w $MPVE$PNQPTFS͕؅ཧ͍ͯ͠ΔLTΫϥελΛ͋·Γ৮Γͨ͘ͳ͔ͬͨͷͰɺ౿Έ୆ͱ ͳΔ($&ΠϯελϯεΛཱͯͯղܾͨ͠ɻ from redis

    import StrictRedis from sshtunnel import SSHTunnelForwarder with SSHTunnelForwarder((bastion_host, bastion_port), ssh_username="airflow", remote_bind_address=(redis_host, redis_port), local_bind_address=("127.0.0.1", local_port), allow_agent=False): client = StrictRedis(host="127.0.0.1", port=local_port) client.ping()
  26. UFNQMBUF@FYU w ࢦఆͨ͠஋ͰऴΘΔจࣈྻΛϑΝΠϧύεͱͯ͠ղऍ͠ɺ࣮ମΛࢦఆ͞ΕͨϑΝΠϧͷத ਎Λ+JOKBͰϨϯμϦϯάͨ݁͠ՌͰஔ͖׵͑ΔͱΜͰ΋ͳ͍ศརͳػೳɻ w υΩϡϝϯτ͸ແ͍͕͠Εͬͱ#BTF0QFSBUPSʹ࣮૷͞Ε͍ͯΔɻ with open("foo/bar.sql") as f:

    sql = f.read() PythonOperator( template_dict={"sql": sql} # ... ) class SQLTemplateOperator(PythonOperator): template_ext = (".sql",) SQLTemplateOperator( template_dict={"sql": "foo/bar.sql"}, # ... ) IUUQTTUBDLPWFSqPXDPNB
  27. IUUQTXXXqZXIFFMKQDBSFFST We are hiring ͨͷ͍͠Α