Upgrade to Pro — share decks privately, control downloads, hide ads and more …

顧客のアプリケーションコードが動くマルチテナント環境における課題とEKSにたどり着くまで

 顧客のアプリケーションコードが動くマルチテナント環境における課題とEKSにたどり着くまで

F8f7692cb46c0f7feda5210a46b47bcf?s=128

shogomuranushi

March 20, 2020
Tweet

Transcript

  1. ސ٬ͷΞϓϦέʔγϣϯίʔυ͕ಈ͘ Ϛϧνςφϯτ؀ڥʹ͓͚Δ՝୊ͱEKSʹͨͲΓண͘·Ͱ ɹय़ͷAWS ίϯςφࡇΓ with Amazon EKS ABEJA, Inc. Shogo

    Muranushi
  2. Shogo Muranushi ABEJA, Inc. - Site Reliability Engineer Tech Lead

  3. ΞδΣϯμ • ࿩ͷഎܠͱͳΔࣄۀ঺հ • ΞʔΩςΫνϟͷਐԽͱϚϧνςφϯτͷ޻෉఺

  4. None
  5. None
  6. σʔλ औಘ σʔλ ஝ੵ σʔλ ֬ೝ ڭࢣσʔλ ࡞੒ Ϟσϧ ઃܭ

    ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश σʔλ΢ΣΞϋ΢ε ͷ४උͱ؅ཧ σʔλͷόϦσʔγϣϯʢਖ਼֬ੑʣͷ֬ೝ 0͔ΒͷϞσϧઃܭ GPU؀ڥͷ४උͱ ߴ౓ͳ෼ࢄԽ σʔλɺϞσϧɺ݁Ռͷόʔδϣϯ؅ཧ ౷ܭతʹຊ൪ʹσϓϩΠͨ͠ॠؒ ͔Βਫ਼౓͕Լ͕Δ͜ͱΛ୲อ େྔσʔλͷऔಘʹඞཁͳAPI΍ෛՙ෼ࢄ ͷ࢓૊Έ΍४උɺηΩϡϦςΟ୲อ ڭࢣσʔλͷ࡞੒ʹඞཁͳπʔϧͱਓࡐͷ४උ ։ൃ؀ڥ͔Βຊ൪؀ڥ΁ͷҾ͖౉͠ ৑௕ੑ΍GPUϦιʔεͷ୲อɺ Τοδଆͱͷ࿈ܞϓϩηεߏங
  7. σʔλ औಘ σʔλ ஝ੵ σʔλ ֬ೝ ڭࢣσʔλ ࡞੒ Ϟσϧ ઃܭ

    ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश σʔλ΢ΣΞϋ΢ε ͷ४උͱ؅ཧ σʔλͷόϦσʔγϣϯʢਖ਼֬ੑʣͷ֬ೝ 0͔ΒͷϞσϧઃܭ GPU؀ڥͷ४උͱ ߴ౓ͳ෼ࢄԽ σʔλɺϞσϧɺ݁Ռͷόʔδϣϯ؅ཧ ౷ܭతʹຊ൪ʹσϓϩΠͨ͠ॠؒ ͔Βਫ਼౓͕Լ͕Δ͜ͱΛ୲อ େྔσʔλͷऔಘʹඞཁͳAPI΍ෛՙ෼ࢄ ͷ࢓૊Έ΍४උɺηΩϡϦςΟ୲อ ڭࢣσʔλͷ࡞੒ʹඞཁͳπʔϧͱਓࡐͷ४උ ։ൃ؀ڥ͔Βຊ൪؀ڥ΁ͷҾ͖౉͠ ৑௕ੑ΍GPUϦιʔεͷ୲อɺ Τοδଆͱͷ࿈ܞϓϩηεߏங AI׆༻·Ͱʹ਺ଟ͘ͷ՝୊͕ଘࡏ
  8. Ref: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf “ As the machine learning (ML) community continues

    to accumulate years of experience with live systems ” “ ։ൃ͓ΑͼMLγεςϜΛಋೖ͢Δ͜ͱ͸ൺֱతߴ଎Ͱ҆ՁͰ͕͢ɺ࣌ؒΛ͔͚ͯ ͦΕΛҡ࣋͢Δ͜ͱ͸ࠔ೉͔ͭߴՁͰ͋Δ”
  9. None
  10. ΞʔΩςΫνϟ

  11. ΞʔΩςΫνϟ ݱঢ়͸EKSϕʔεͰ͕ͨ͠ɺ։ൃॳظ͸ECSΛϕʔεͰ
 ৭ʑ޻෉͠ͳ͕Βࠓͷߏ੒ʹͨͲΓண͖·ͨ͠

  12. ΞʔΩςΫνϟ ͦ͜ʹࢸΔ·ͰͷΞʔΩςΫνϟͷਐԽͷաఔͰ ۤ࿑ͨ͠఺ɾ޻෉ͨ͠఺ͱɺ
 ʮސ٬ͷίʔυ͕ಈ͘ʯϚϧνςφϯτʢʁʣ؀ڥԼͰ
 KubernetesΛ༻͍ͯϓϥοτϑΥʔϜΛߏங͢Δࡍͷ
 ۤ࿑ͨ͠఺ɾ޻෉ͨ͠఺Λ࿩͠·͢

  13. ʲલఏʳϚϧνςφϯτͱ͸ʁ

  14. None
  15. ͻͱͪ͘ʹϚϧνςφϯτ ͱݴͬͯ΋৭ʑ͋Γ·͕͢ɺ ๻ͨͪͷ؀ڥ͸ͪ͜ΒͰ͢

  16. ΞʔΩςΫνϟ

  17. Datalake

  18. ୈҰੈ୅ΞʔΩςΫνϟ of Datalake • Raw σʔλΛ஝ੵ͢ΔͨΊͷαʔϏε • ετϨʔδ͸ S3 •

    ॳظόʔδϣϯ͸ API Gateway ͱ Lambda Ͱ REST API Ͱఏڙ • ౰ॳ͸ϑΝΠϧͷอଘɺऔಘɺҰཡͳͲͷΦϖϨʔγϣϯͷΈ • ߏஙʹ͸Serverless FrameworkΛར༻ • Signed URLΛൃߦ͠Ξοϓϩʔυͯ͠΋Β͏࢓༷
  19. ୈҰੈ୅ΞʔΩςΫνϟ of Datalake Good • ϝϯςφϯεϑϦʔ Bad • API GatewayͭΒ͍

    • ϩʔΧϧͷ࠶ݱੑ͕௿͘։ൃޮ཰ѱ͍ • ϖΠϩʔυαΠζ • Serverless Framework·͊·͊ਏ͍ • ࠓޙଞͷαʔϏε΋ಉ͡ελΠϧͰ։ൃʁ
  20. ୈೋੈ୅ΞʔΩςΫνϟ of Datalake • AWSͷAPI Gateway͸γϯυ͍ͷͰɺAPI GatewayΛ಺੡ • ։ൃޮ཰͕ѱ͍ɺେࣄͳΤϯυϙΠϯτͳͷͰো֐࣌ͷίϯτϩʔϧ͸͔ͨͬͨ͠ •

    API Gateway͕෼཭͞Εͨ͜ͱͰɺೝূɾϧʔςΟϯά͕ڞ௨ʹ • ୔ࢁͷAPI Gatewayͱ͸͓͞Β͹ • Datalake͸LambdaͱS3ͷΈʹ • ౰໘͸͜ͷߏ੒
  21. ୈೋੈ୅ΞʔΩςΫνϟ of Datalake Good • ೝূɾϧʔςΟϯάΛ෼཭͠ڞ௨ʹ • Datalakeͷ੹೚ൣғΛ෼཭ Bad •

    ࣗલAPI Gatewayͷ໘౗ΛݟΔඞཁ͋Γ
  22. ୈࡾੈ୅ΞʔΩςΫνϟ of Datalake • ʮݕࡧ΍Χ΢ϯτ͕͍ͨ͠ʯ • ϝλσʔλݕࡧ΍Χ΢ϯτػೳ༻ͷDBΛߏங • ݕࡧ͸PostgreSQL(Aurora)ͷJSON+GIN IndexͰ࣮૷

    • ॊೈͳϝλσʔλͷ෇༩ͱݕࡧΛͰ͖ΔΑ͏ʹ͠ա͗ͯIndexരൃ • Cassandra…? Or ॊೈੑΛܰݮͤ͞Δ͜ͱΛݕ౼த • S3 Event + SQS + Lambda ͔Β Datalake API Λݺͼग़͢ • όοΫΤϯυͷෛՙ͕଱͖͑Εͳ͔ͬͨɺCW Logs͕ߴ͘ͳͬͨͷͰόοΫΤϯυΛLambda ͔ΒECSʹҠߦ • ʮS3ͷSigned URL͸खؒͳͷͰҰճͰDatalakeʹΞοϓϩʔυ͍ͨ͠ʯ • S3΁ͷPut͸API GatewayͰ୲͏Α͏ʹ࢓༷มߋ
  23. ୈࡾੈ୅ΞʔΩςΫνϟ of Datalake Good • Signed URL͸ෆཁʹͳΓUXվળ • ϝλσʔλݕࡧͰ͖ΔΑ͏ʹ Bad

    • ΋͸΍αʔόϨε͸ແ͘ͳͬͨͷͰϝϯ ςίετ૿Ճ • ϝλσʔλػೳ͕ࣗ༝ա͗ͯIndexංେԽ Ͱਏ͍ όοΫΤϯυෛՙɺCWLogsίετ૿ՃʹΑΓ ECSʹมߋ
  24. Serving

  25. ୈྵੈ୅ΞʔΩςΫνϟ of Serving • ਪ࿦༻APIΛϗεςΟϯά͢ΔαʔϏε • ίϯςφ • ॳظόʔδϣϯ͸ϓϩτλΠϓͱͯ͠CloudFormationͰ ELB

    + ECS Service Λ࡞੒
  26. ୈྵੈ୅ΞʔΩςΫνϟ of Serving Good • γϯϓϧ • ҰׅͰؔ࿈ϦιʔεΛ࡞੒ɾ࡟আͰ͖Δ Bad •

    1Serviceຖʹ1ͭͷELB͸ແବ • CFnͰ࡞੒͞ΕΔͷʹ਺෼͔͔Γ஗͍ • CFn͸ඇಉظͳͷͰΤϥʔݕ஌͕೉͍͠
  27. ୈҰੈ୅ΞʔΩςΫνϟ of Serving • 1 Loadbalancer = Muliti Serviceʹมߋ •

    ALBͷϧʔςΟϯάϧʔϧ͸100ݸ͕ϋʔυϦϛοτ • ސ٬ͷAPI͕ͲΜͲΜ৐ΔͨΊ͙͢ʹഁ୼͢Δ͜ͱ͕ݟ͑ͯͨ • ·ͨ΋΍ࣗલͰGatewayΛ։ൃʢECS + Lambda + DynamoDBʣ orz • ސ٬ͷΞϓϦέʔγϣϯ͕ৗ࣌Τϥʔൃੜɻίϯςφ͕࠶ىಈ͠·͘ΓEBSόʔετΫϨδοτΛ৯͍ͭͿ͢ • ECS Sevice͸ৗʹىಈ͢Δ࢓༷ͰαʔΩοτɾϒϨΠΫ͸ແ͍ • γεςϜىҼͷΤϥʔͰͷαʔΩοτɾϒϨΠΫ͸ޙ೔࣮૷͞Ε͕ͨɺΞϓϦىҼͷαʔΩοτɾϒϨΠ Ϋ͸·ͩແ͍ • ͳͷͰ࠶ىಈ܁Γସ͑͠IO৯͍ͭͿ͢ʢѹ౗తϊΠδʔωΠόʔʣ • ࣗલʢLambdaʣͰϔϧενΣοΫͯ͠ɺҰఆճ਺Failͷ৔߹͸ϧʔςΟϯά͠ͳ͍Α͏ʹࣗલGatewayͷ DynamoDBʹอଘ
  28. ୈҰੈ୅ΞʔΩςΫνϟ of Serving • Blue/Greenػೳͷ࣮૷ • APIͷΤϯυϙΠϯτͷ޲͖ઌͱͳΔϞσϧΛ؆୯ʹ੾Γସ͑ΒΕΔػೳ • ECSλεΫ͔ΒϗετͷIAMϩʔϧ͕৮Εͯ͠·͏໰୊ •

    iptablesͰmetadata΁ͷΞΫηεΛͿͬͨ੾Δ • ࠷ۙ͸ɺawspvc ωοτϫʔΫϞʔυͷ৔߹ͷ৔߹͸ҎԼͷΑ͏ʹ؆୯ • ECS_AWSVPC_BLOCK_IMDS: true
  29. ୈҰੈ୅ΞʔΩςΫνϟ of Serving Good • 1ਪ࿦API = 1ELBͰ͸ແ͘ͳΔͨΊίετμ ΢ϯ •

    Blue/Green͕Ͱ͖ΔΑ͏ʹ • EBSόʔετ໰୊ղܾ • EC2 metadataΞΫηεͰ͖ͳ͍Α͏ʹ Bad • ࣗલ࣮૷͕ଟ͘ͳ͖ͬͯͯϝϯςίετ ͕͕͕ɻͱ͸͍͑ɺAWSͷػೳͰ͸Χ όʔͯ͠ͳ͍՝୊͸ࣗલ࣮૷͔͠ͳ͍
  30. ୈೋੈ୅ΞʔΩςΫνϟ of Serving • ECSͱEC2ͷ૬ੑ͕ѱ͍໰୊ • AutoScalingൃಈ࣌ʹίϯςφΛແࢹͯ͠EC2ΛTerminate͢ΔɻDrainͯ͠Αʢࠓ͸ղফ͞Ε͍ͯΔʁʣ • AutoScalingൃಈ৚݅͸CPU༧໿ྔϕʔε͕جຊͳͷͰɺίϯςφαΠζͷεέʔϧ͕͔ͳΓखؒͩ͠ɺܭ ࢉϩδοΫ΋ΊΜͲ͍͘͞

    • ۭ͖Ϧιʔε͕͋ͬͯ΋ू໿͞Εͳ͍ • ΠϯελϯεΛೖΕସ͑ΔࡍͷBlue/Green Deployment͸ࣗલ࣮૷ • ސ٬ͷਪ࿦APIΛ୔ࢁࡌͤΔͱίετ͕ංେԽ͢Δ • εϙοτΠϯελϯεΛ͏·͘ѻ͑ͳ͍͔ݕ౼ • ΠϯελϯελΠϓ΍κʔϯΛࢄΓ͹ΊͨΓɺεϙοτΠϯελϯε͕ചΓ੾Εͨ࣌ʹΦϯσϚϯυʹ੾ Γସ͑ͨΓ͢Δඞཁ͕͋Δ • ݕূͷ݁ՌɺSpotinstͱ͍͏αʔϏεΛೖΕΔ͜ͱʹ
  31. Spotinstʹ͍ͭͯ͸ͪ͜ΒΛ

  32. Bad • εϙοτΠϯελϯεͷద༻ൣғΛؒҧ ͑Δͱҙਤ͠ͳ͍ఀࢭ͕ൃੜ͢Δ ʢJupyter Notebookͱ͔ʣ ୈೋੈ୅ΞʔΩςΫνϟ of Serving Good

    • SpotinstΛར༻͢Δ͜ͱʹΑΓECS͕଍ Γͳ͍ͱ͜ΖΛΧόʔ • Blue/Green Deployment͕ڧ੍͞ΕΔ • ू໿ޮ཰্͕͕Γɺ60-70%ίετμ΢ϯ
  33. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving • 1 ServiceͰ 300ίϯςφΛಈ͔͢πϫϞϊ͕ग़͖ͯͨ • DynamoDBͷύʔςΟγϣχϯάͷภΓ͕ൃੜ͠εϩοτϦϯά͕େྔʹ •

    Kubernetesػӡ͕ߴ·͖ͬͯͨɻKubernetesΛ׆༻͢Δ͜ͱͰ • API Gateway૬౰ͷػೳ͸Ambassador(Envoy)Ͱ୅༻Մೳ • ͔͠͠ɺService/Pod͕େྔʹଘࡏ͢ΔͱEnvoyͷϧʔςΟϯάϧʔϧͷߋ৽ʹ਺ेඵ͔͔Δ͜ͱ΋ • Service DiscoveryɺϔϧενΣοΫ͸ඪ४ػೳͰ೚ͤΒΕΔ • Service؀ڥ΋KubernetesԽͱͱ΋ʹࣗલ࣮૷෦෼ͷݮΒ͢ํ޲ʹ
  34. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving Good • ϔϧενΣοΫɾαʔϏεσΟεΧόϦΛࣗ લ࣮૷͔ΒKubernetesͷҰൠతͳػೳʹஔ͖ ׵͑ • GatewayͷػೳΛKubernetesͷ֦ுػೳ

    ʢAmbassadorʣʹஔ͖׵͑ Bad • Ambassador(Envoy)ͷཧղ͕ඞཁʹ • ࣗલ࣮૷ΑΓ͸༷ʑͳ໘Ͱߟྀ͞Ε͍ͯ ͯϝϯςίετ͸Լ͕Δ͕ΧελϚΠζ ੑ͸མͪΔ
  35. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving • ݱঢ়ͷ՝୊ • ސ٬ຖͷPodͷΞ΢τό΢ϯυͷసૹྔͷܭଌํ๏͕Θ͔Βͳ͍ • GKEʹ͸͋ΔΒ͍͠ɻIstioͷग़൪͔ɾɾʁ •

    ސ٬ຖʹίετΛՄࢹԽ͢ΔͨΊʹKubernetesͷeventΛhook͠·͘Βͳ͍ͱ͍͚ͳ͍ • ͜ΕGKE͸usage meteringͱ͍͏ͷͰग़དྷΔΆ͍ͷͰEKSͰ΋ͥͻ΍ͬͯ΄͍͠ɻࣄۀ෦ϚϧνςφϯτͰ΋ඞཁͱࢥ͏ • Kubernetes ͷ݁Ռ੔߹ੑͷৼΔ෣͍ͷ্ʹࣗ෼ͨͪͷγεςϜΛߏங͢Δ೉͠͞ • ྫ͑͹ • Pod ͷ STATUS ͕ Running ʹͳΔɻૄ௨ग़དྷΔ͔ͱࢥ͍͖΍ Ready (readinessProbe) ͕ 0/1 • Ready ͕ 1/1ʹͳΔɻૄ௨Ͱ͖Δ͔ͱࢥ͍͖΍ Ambassador (Envoy) ͷͱ͋Δ Pod ͸ૄ௨Ͱ͖Δ͕ɺͱ͋Δ Pod ͸ߋ ৽଴ͪͷͨΊૄ௨Ͱ͖ͳ͍ɻ෼ࢄγεςϜʹ͓͚Δ݁Ռ੔߹ͳͷͰʮ͍ͭʯ͔Β࢖͑ΔΑ͏ʹͳͬͨঢ়ଶ͔ΛϢʔβ ʹ஌ΒͤΔͷ͕೉͍͠ • ͦΕͧΕͷػೳ͕݁Ռ੔߹Λ୲อ͍ͯ͠Δ͕ނʹɺ࿈ಈͯ͠ཉ͍͠ͱ͜ΖΛຒΊΔඞཁ͕͋Δ
  36. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving • ݱঢ়ͷ՝୊ • IPΞυϨεͷރׇ • 1ΠϯελϯεลΓ Serving

    = 20ݸɺTraining = 30ݸ Λ࢖༻͍ͯ͠Δɻ20ݸ * 400 Service ͕࡞ΒΕͨ࣌ʹ8,000ݸͷIP Λ࢖༻ɻ·͔͞ͷ /16 αϒωοτ͕ރׇ • એݴతʹͳΓͮΒ͍ • SDKͰσϓϩΠͯ͠ΔͷͰʮyamlΛ΋͏Ұ౓ద༻ͨ͠Β࠶ݱͰ͖ΔΑʯঢ়ଶʢએݴతʣʹ͸ͳ͍ͬͯͳ͍ɻ൵͍͠ • ؂ࢹ͠ਏ͍ • ϢʔβʔىҼͷ໰୊ͱϓϥοτϑΥʔϜͷ໰୊ͷ੾Γ෼͚ํ๏͕೉͍͠ • ސ٬ىҼͰࢮΜͰΔpod͕ଘࡏ͢Δɻશ͕ͯਖ਼ৗʹՔಇ͍ͯ͠Δ༁Ͱ͸ͳ͍ • ಠࣗGatewayͰ5xx/4xxͷ؂ࢹ͕Ͱ͖ͳ͍ɻόοΫΤϯυ͕ސ٬ґଘͷͨΊ5xx͸े෼༗ΓಘΔ • ϚΠΫϩαʔϏε͋Δ͋Δ • LB/Proxy͕ଟஈͳͷͰௐࠪͮ͠Β͍ɻސ٬ʹͲ͜·ͰϩάɾϝτϦΫεΛग़ͤ͹ྑ͍ͷ͔೉͍͠
  37. Training

  38. ୈҰੈ୅ΞʔΩςΫνϟ of Training • ֶशδϣϒΛ࣮ߦ͢Δج൫ • ॳظόʔδϣϯ͔ΒKubernetesϕʔεͰ࣮૷ͨ͠ • JobɺPodʢartifactอଘ༻ίϯςφʣͳͲΛECSͰࣗલ࣮૷͸ඇޮ཰ͩͬͨͨΊ •

    ͔͠͠EKS͸ແ͔ͬͨͷͰ on EC2 Ͱ • ސ٬ͷίϯςφͱ؅ཧܥίϯςφ͸PodΛ෼͚ͨ • TrainingޙͷϞσϧΛs3ʹࣗಈอଘ͢ΔίϯςφɺϩάΛऩू͢Δίϯςφ͸ɺϢʔβͷίʔυͱಉډͤ͞ΔͱIAMͷݖ ݶతʹྑ͘ͳ͍ͷͰɺผPodͱͯ͠agentతʹىಈ • kube2iam Λ༻͍ͯPodຖʹIAMϩʔϧΛΞλονʢࠓ͸ެ͕ࣜग़͍ͯΔʁʣ • privileged͸Կ͕͋ͬͯ΋off • GPUυϥΠόपΓ͸ۤ࿑͢Δ͚ͲؤுΔ
  39. ୈҰੈ୅ΞʔΩςΫνϟ of Training Good • ECS্Ͱंྠͷ࠶ൃ໌͕ෆཁʹͳͬͨ Bad • Kubernetes on

    EC2 ͸ӡ༻͕݁ߏਏ͍
  40. ୈೋੈ୅ΞʔΩςΫνϟ of Training • Jupyter Notebook͕ཉ͍͠ • Tensorboard͕ݟ͍ͨ • Jobؒ΍NotebookؒͰσʔληοτ΍݁ՌΛڞ༗͍ͨ͠

  41. ୈೋੈ୅ΞʔΩςΫνϟ of Training Good • Jupyter NotebookɺTensorboradΛఏڙ • ڞ༗ετϨʔδͷఏڙ Bad

    • Kubernetes on EC2 ͸ӡ༻͕݁ߏਏ͍ • Jupyterະ࢖༻࣌ͷՔಇίετ͕ແବ • ڞ༗ετϨʔδ͕ߴͯ͘NFSͳͷͰ஗͍
  42. ୈࡾੈ୅ΞʔΩςΫνϟ of Training • EKS͕ग़ͨͷͰࡌͤସ͍͑ͨ • Jupyter͸࢖ͬͯͳ͚Ε͹ࣗಈͰམͱͯ͠ཉ͍͠ • EFSͷίετ͕ංେԽ͢ΔͷͰS3ʹࣗಈͰ໭͠ɺEFSͷத਎͸ࣗಈফ͍ͨ͠

  43. ୈࡾੈ୅ΞʔΩςΫνϟ of Training Good • EKSҠߦʹΑΓӡ༻ෛՙݮΔ • EFS͔ΒS3΁ͷॻ͖ࠐΈɾ໭͠ʹΑΓί ετμ΢ϯ •

    JupyterͷࣗಈఀࢭʹΑΓίετμ΢ϯ Bad • EFS͕NFSͳͷͰ஗͍
  44. ୈࡾੈ୅ΞʔΩςΫνϟ of Training • ݱঢ়ͷ՝୊ • ೥ؒܭըͷച্ɾݪՁ • ਫ਼៛Խ͢Δඞཁ͕͋Δ͕ɺސ٬࣍ୈͳͷͰݪՁΛ༧ଌ͢Δͱ͔΋͸΍Α͘෼͔ΒΜɻAWS͞ΜͲ͏΍ͬͯ ؅ཧ͍ͯ͠ΔͷͩΖ͏

    • όά • p3.16xlarge͕ఀࢭͤͣʹPສҐ͔͔ͬͯΔ࣌΋͋ͬͨ • OS • nvidia-driver͕αϙʔτ͢ΔOSͰ͋Δඞཁ͕͋Δɻͭ·ΓUbuntu or Amazon Linux2ɻʮBottlerocketʯͷ nvidia-driverαϙʔτظ଴
  45. ୈࡾੈ୅ΞʔΩςΫνϟ of Training • ݱঢ়ͷ՝୊ • ίϯςφؒͷґଘؔ܎ • ϩάΛ࿙Εͳ͘ऩू͢ΔͨΊʹ Affinity

    Λۦ࢖ͯ͠ log collector pod -> platform agent pod -> training job ͱ͍͏༏ઌॱҐΛ෇͚ͯPodΛىಈͤ͞Δͱ ɺඞཁͳϦιʔε͕଍Γͳ͍EventͷൃՐ͕஗ΕΔͨΊ Autoscaler ΁ͷ௨஌͕஗ΕɺΠϯελϯεͷىಈ͕஗͘ͳΔ • DockerΠϝʔδɾύοέʔδͷޓ׵ੑɺαϙʔτ • ఏڙ͢ΔDockerΠϝʔδͷޓ׵ੑҡ͕࣋೉͍͠ • αϙʔτର৅ • DLϥΠϒϥϦͷछྨ x όʔδϣϯ x Pythonόʔδϣϯ x CUDAͷόʔδϣϯ …
  46. Trigger

  47. ୈҰੈ୅ΞʔΩςΫνϟ of Trigger • ਪ࿦δϣϒΛ࣮ߦ͢Δج൫ • Datalakeʹσʔλ͕౤ೖ͞Εͨ͜ͱΛτϦΨʔʹൃಈ • S3->SNS->SQSͰɺҰ୴ΩϡʔΠϯά͢Δ •

    SQSͷQueue͔ΒSubscriber͕औಘ͠ɺAWS Batch΁λεΫΛ౤͛Δ • δϣϒ͕ऴΘΕ͹Πϯελϯε͕ࣗಈఀࢭ͢Δ • ֖Λ։͚Ε͹1෼ʹԿඦͱδϣϒ͕౤͛ΒΕΔ
  48. ୈҰੈ୅ΞʔΩςΫνϟ of Trigger Good • ٸܹͳෛՙ͸SQS͕ٵऩ • Subscriber͸Queueͷ਺Ͱεέʔϧ • AWS

    Batch͸౤͛ͨΒྑ͍͚ͩ Bad • ىಈ͕஗͍ɺϝτϦΫεແ͍ɺϩά͕ू໿ ͞ΕͯJob୯ҐͰݟΕͳ͍ • AWS BatchͷϢʔεέʔεʹ߹ͬͯͳ͔ͬͨ • AZؒͷωοτϫʔΫసૹྔ͕Ϡό͍
  49. Logging for Customer

  50. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for Customer • ސ٬ʹఏڙ͢Δϩάج൫ • ServingͱTrainingɺTriggerͷϩάΛఏڙ •

    ϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͨΊɺग़དྷΔݶΓࣗલͰӡ༻ͨ͘͠ͳ͍
  51. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for Customer Good • αʔόϨεܥʢCWLogs / Kinesiss

    / Lambda / DynamoDBʣͳͷͰӡ༻؅ཧෆ ཁ Bad • ϩάྔ͕ٸ૿͢ΔͱLambda͔ DynamoDBͰεϩοτϦϯάى͖Δ • CWLogs࣮࣭࢖ͬͯͳׂ͍ʹ݁ߏߴ͍
  52. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for Customer • ϩάͷόοΫΤϯυΛDatadog Logsʹมߋ • ServiceɺTrainingͷϩάΛDatadog

    Logsʹอଘ • Datadog Logs ͷAPI͔ΒϩάΛऔಘ͠ސ٬ʹఏڙ • ElasticSearchʹ͢Δ͔ߟ͕͑ͨ • ݕ౼࣍఺Ͱϩάྔ͸5ԯϨίʔυ/݄ • ࣄۀͷ੒௕ͱϩΪϯάର৅Λ૿΍͢͜ͱΛߟ͑Δͱ3ϲ݄ຖʹഒʑʹ૿͑Δ • ഒʑʹ૿͑ΔElasticSearchΛӡ༻ͨ͘͠ͳ͔ͬͨ͠ɺϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͷͰӡ༻ί ετΛֻ͚ͨ͘ͳ͔ͬͨ
  53. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for Customer Good • ϑϧϚωʔδυͳͷͰӡ༻ϑϦʔ Bad •

    ͓͕͔͔ۚΔɻͱ͸͍͑ɺCWLogsΑΓ҆͘ ࣗલͰӡ༻͢ΔΑΓϚγ • Datadogͷ࢓༷ʹҾͬுΒΕΔ • ݁Ռ੔߹ɺॱংอূແ͠ • datadog-agentͷڍಈɾ࢓༷
  54. Logging for System & Application

  55. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for System & Application • ࣾ಺Ͱར༻͢ΔγεςϜϩάɾΞϓϦέʔγϣϯϩάͷऩूج൫ •

    ϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͨΊɺग़དྷΔݶΓࣗલͰӡ༻ͨ͘͠ͳ͍ • ECS΍LambdaΛத৺ʹར༻͍ͯͨͨ͠ΊCloudWatch Logsʹϩά͸֨ೲ͞Ε͍ͯͨ • ͱΓ͋͑ͣCW LogsΛར༻ͨ͠
  56. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for System & Application Good • Πϯϑϥͷ͜ͱ͸ߟ͑ͳͯ͘ྑ͍

    Bad • Ͳ͜ʹ֨ೲ͞Ε͍ͯΔ͔෼͔Βͳ͍ • ݕࡧੑօແɻ໨grepྗ্͕Δ • ϚΠΫϩαʔϏεؒͷϩάௐࠪͱ͔͔ͳΓπ ϥϛ͔͠ͳ͍ • ݁Ռɺ໰୊͔͋ͬͨ࣌͠ϩάΛݟͳ͍ • Ҏ֎ʹCWLogs͸ߴ͍
  57. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for System & Application • CWLogs΁ͷసૹ͸ࣙΊͯɺDatadog Logsʹू໿ͨ͠

  58. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for System & Application Good • Πϯϑϥͷ͜ͱ͸ߟ͑ͳͯ͘ྑ͍

    • ҰՕॴͰશ෦ݕࡧग़དྷΔ • Tag, Attributeʹରͯ͠IndexுΕͯݕࡧର৅ʹ Ͱ͖Δ • Ͳͷ߲໨ʹԿ݅͋Δ͔Ұ໨ྎવ • ϚΠΫϩαʔϏεͳͷͰɺTracing IDͳͲΛຒ ΊࠐΜͰ௥͍΍ͨ͘͢͠Γ • APIͷ܏޲෼ੳͨ͠ΓɺӨڹൣғௐ΂ͨΓͱ ׆༻͕޿͕ͬͨ Bad • ͓͕͔͔ۚΔɻͱ͸͍͑ɺCWLogsΑΓ҆ࣗ͘ લͰӡ༻͢ΔΑΓϚγ
  59. ͰɺݱࡏʹࢸΔ

  60. ݁࿦ • ސ٬ͷΞϓϦέʔγϣϯίʔυ͕ಈ͘ϓϥοτϑΥʔϜ Λ࡞ͬͯΔAWS͞ΜεΰΠ

  61. Ҏ্