Upgrade to Pro — share decks privately, control downloads, hide ads and more …

顧客のアプリケーションコードが動くマルチテナント環境における課題とEKSにたどり着くまで

 顧客のアプリケーションコードが動くマルチテナント環境における課題とEKSにたどり着くまで

shogomuranushi

March 20, 2020
Tweet

More Decks by shogomuranushi

Other Decks in Technology

Transcript

  1. ސ٬ͷΞϓϦέʔγϣϯίʔυ͕ಈ͘
    Ϛϧνςφϯτ؀ڥʹ͓͚Δ՝୊ͱEKSʹͨͲΓண͘·Ͱ
    ɹय़ͷAWS ίϯςφࡇΓ with Amazon EKS
    ABEJA, Inc.
    Shogo Muranushi

    View Slide

  2. Shogo Muranushi
    ABEJA, Inc.
    - Site Reliability Engineer Tech Lead

    View Slide

  3. ΞδΣϯμ
    • ࿩ͷഎܠͱͳΔࣄۀ঺հ
    • ΞʔΩςΫνϟͷਐԽͱϚϧνςφϯτͷ޻෉఺

    View Slide

  4. View Slide

  5. View Slide

  6. σʔλ
    औಘ
    σʔλ
    ஝ੵ
    σʔλ
    ֬ೝ
    ڭࢣσʔλ
    ࡞੒
    Ϟσϧ
    ઃܭ
    ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश
    σʔλ΢ΣΞϋ΢ε
    ͷ४උͱ؅ཧ
    σʔλͷόϦσʔγϣϯʢਖ਼֬ੑʣͷ֬ೝ
    0͔ΒͷϞσϧઃܭ
    GPU؀ڥͷ४උͱ
    ߴ౓ͳ෼ࢄԽ
    σʔλɺϞσϧɺ݁Ռͷόʔδϣϯ؅ཧ
    ౷ܭతʹຊ൪ʹσϓϩΠͨ͠ॠؒ
    ͔Βਫ਼౓͕Լ͕Δ͜ͱΛ୲อ
    େྔσʔλͷऔಘʹඞཁͳAPI΍ෛՙ෼ࢄ
    ͷ࢓૊Έ΍४උɺηΩϡϦςΟ୲อ
    ڭࢣσʔλͷ࡞੒ʹඞཁͳπʔϧͱਓࡐͷ४උ
    ։ൃ؀ڥ͔Βຊ൪؀ڥ΁ͷҾ͖౉͠
    ৑௕ੑ΍GPUϦιʔεͷ୲อɺ
    Τοδଆͱͷ࿈ܞϓϩηεߏங

    View Slide

  7. σʔλ
    औಘ
    σʔλ
    ஝ੵ
    σʔλ
    ֬ೝ
    ڭࢣσʔλ
    ࡞੒
    Ϟσϧ
    ઃܭ
    ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश
    σʔλ΢ΣΞϋ΢ε
    ͷ४උͱ؅ཧ
    σʔλͷόϦσʔγϣϯʢਖ਼֬ੑʣͷ֬ೝ
    0͔ΒͷϞσϧઃܭ
    GPU؀ڥͷ४උͱ
    ߴ౓ͳ෼ࢄԽ
    σʔλɺϞσϧɺ݁Ռͷόʔδϣϯ؅ཧ
    ౷ܭతʹຊ൪ʹσϓϩΠͨ͠ॠؒ
    ͔Βਫ਼౓͕Լ͕Δ͜ͱΛ୲อ
    େྔσʔλͷऔಘʹඞཁͳAPI΍ෛՙ෼ࢄ
    ͷ࢓૊Έ΍४උɺηΩϡϦςΟ୲อ
    ڭࢣσʔλͷ࡞੒ʹඞཁͳπʔϧͱਓࡐͷ४උ
    ։ൃ؀ڥ͔Βຊ൪؀ڥ΁ͷҾ͖౉͠
    ৑௕ੑ΍GPUϦιʔεͷ୲อɺ
    Τοδଆͱͷ࿈ܞϓϩηεߏங
    AI׆༻·Ͱʹ਺ଟ͘ͷ՝୊͕ଘࡏ

    View Slide

  8. Ref: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
    “ As the machine learning (ML) community continues to accumulate years of experience with live systems ”
    “ ։ൃ͓ΑͼMLγεςϜΛಋೖ͢Δ͜ͱ͸ൺֱతߴ଎Ͱ҆ՁͰ͕͢ɺ࣌ؒΛ͔͚ͯ
    ͦΕΛҡ࣋͢Δ͜ͱ͸ࠔ೉͔ͭߴՁͰ͋Δ”

    View Slide

  9. View Slide

  10. ΞʔΩςΫνϟ

    View Slide

  11. ΞʔΩςΫνϟ
    ݱঢ়͸EKSϕʔεͰ͕ͨ͠ɺ։ൃॳظ͸ECSΛϕʔεͰ

    ৭ʑ޻෉͠ͳ͕Βࠓͷߏ੒ʹͨͲΓண͖·ͨ͠

    View Slide

  12. ΞʔΩςΫνϟ
    ͦ͜ʹࢸΔ·ͰͷΞʔΩςΫνϟͷਐԽͷաఔͰ
    ۤ࿑ͨ͠఺ɾ޻෉ͨ͠఺ͱɺ

    ʮސ٬ͷίʔυ͕ಈ͘ʯϚϧνςφϯτʢʁʣ؀ڥԼͰ

    KubernetesΛ༻͍ͯϓϥοτϑΥʔϜΛߏங͢Δࡍͷ

    ۤ࿑ͨ͠఺ɾ޻෉ͨ͠఺Λ࿩͠·͢

    View Slide

  13. ʲલఏʳϚϧνςφϯτͱ͸ʁ

    View Slide

  14. View Slide

  15. ͻͱͪ͘ʹϚϧνςφϯτ
    ͱݴͬͯ΋৭ʑ͋Γ·͕͢ɺ
    ๻ͨͪͷ؀ڥ͸ͪ͜ΒͰ͢

    View Slide

  16. ΞʔΩςΫνϟ

    View Slide

  17. Datalake

    View Slide

  18. ୈҰੈ୅ΞʔΩςΫνϟ of Datalake
    • Raw σʔλΛ஝ੵ͢ΔͨΊͷαʔϏε
    • ετϨʔδ͸ S3
    • ॳظόʔδϣϯ͸ API Gateway ͱ Lambda Ͱ REST API Ͱఏڙ
    • ౰ॳ͸ϑΝΠϧͷอଘɺऔಘɺҰཡͳͲͷΦϖϨʔγϣϯͷΈ
    • ߏஙʹ͸Serverless FrameworkΛར༻
    • Signed URLΛൃߦ͠Ξοϓϩʔυͯ͠΋Β͏࢓༷

    View Slide

  19. ୈҰੈ୅ΞʔΩςΫνϟ of Datalake
    Good
    • ϝϯςφϯεϑϦʔ
    Bad
    • API GatewayͭΒ͍
    • ϩʔΧϧͷ࠶ݱੑ͕௿͘։ൃޮ཰ѱ͍
    • ϖΠϩʔυαΠζ
    • Serverless Framework·͊·͊ਏ͍
    • ࠓޙଞͷαʔϏε΋ಉ͡ελΠϧͰ։ൃʁ

    View Slide

  20. ୈೋੈ୅ΞʔΩςΫνϟ of Datalake
    • AWSͷAPI Gateway͸γϯυ͍ͷͰɺAPI GatewayΛ಺੡
    • ։ൃޮ཰͕ѱ͍ɺେࣄͳΤϯυϙΠϯτͳͷͰো֐࣌ͷίϯτϩʔϧ͸͔ͨͬͨ͠
    • API Gateway͕෼཭͞Εͨ͜ͱͰɺೝূɾϧʔςΟϯά͕ڞ௨ʹ
    • ୔ࢁͷAPI Gatewayͱ͸͓͞Β͹
    • Datalake͸LambdaͱS3ͷΈʹ
    • ౰໘͸͜ͷߏ੒

    View Slide

  21. ୈೋੈ୅ΞʔΩςΫνϟ of Datalake
    Good
    • ೝূɾϧʔςΟϯάΛ෼཭͠ڞ௨ʹ
    • Datalakeͷ੹೚ൣғΛ෼཭
    Bad
    • ࣗલAPI Gatewayͷ໘౗ΛݟΔඞཁ͋Γ

    View Slide

  22. ୈࡾੈ୅ΞʔΩςΫνϟ of Datalake
    • ʮݕࡧ΍Χ΢ϯτ͕͍ͨ͠ʯ
    • ϝλσʔλݕࡧ΍Χ΢ϯτػೳ༻ͷDBΛߏங
    • ݕࡧ͸PostgreSQL(Aurora)ͷJSON+GIN IndexͰ࣮૷
    • ॊೈͳϝλσʔλͷ෇༩ͱݕࡧΛͰ͖ΔΑ͏ʹ͠ա͗ͯIndexരൃ
    • Cassandra…? Or ॊೈੑΛܰݮͤ͞Δ͜ͱΛݕ౼த
    • S3 Event + SQS + Lambda ͔Β Datalake API Λݺͼग़͢
    • όοΫΤϯυͷෛՙ͕଱͖͑Εͳ͔ͬͨɺCW Logs͕ߴ͘ͳͬͨͷͰόοΫΤϯυΛLambda
    ͔ΒECSʹҠߦ
    • ʮS3ͷSigned URL͸खؒͳͷͰҰճͰDatalakeʹΞοϓϩʔυ͍ͨ͠ʯ
    • S3΁ͷPut͸API GatewayͰ୲͏Α͏ʹ࢓༷มߋ

    View Slide

  23. ୈࡾੈ୅ΞʔΩςΫνϟ of Datalake
    Good
    • Signed URL͸ෆཁʹͳΓUXվળ
    • ϝλσʔλݕࡧͰ͖ΔΑ͏ʹ
    Bad
    • ΋͸΍αʔόϨε͸ແ͘ͳͬͨͷͰϝϯ
    ςίετ૿Ճ
    • ϝλσʔλػೳ͕ࣗ༝ա͗ͯIndexංେԽ
    Ͱਏ͍
    όοΫΤϯυෛՙɺCWLogsίετ૿ՃʹΑΓ
    ECSʹมߋ

    View Slide

  24. Serving

    View Slide

  25. ୈྵੈ୅ΞʔΩςΫνϟ of Serving
    • ਪ࿦༻APIΛϗεςΟϯά͢ΔαʔϏε
    • ίϯςφ
    • ॳظόʔδϣϯ͸ϓϩτλΠϓͱͯ͠CloudFormationͰ ELB + ECS Service Λ࡞੒

    View Slide

  26. ୈྵੈ୅ΞʔΩςΫνϟ of Serving
    Good
    • γϯϓϧ
    • ҰׅͰؔ࿈ϦιʔεΛ࡞੒ɾ࡟আͰ͖Δ
    Bad
    • 1Serviceຖʹ1ͭͷELB͸ແବ
    • CFnͰ࡞੒͞ΕΔͷʹ਺෼͔͔Γ஗͍
    • CFn͸ඇಉظͳͷͰΤϥʔݕ஌͕೉͍͠

    View Slide

  27. ୈҰੈ୅ΞʔΩςΫνϟ of Serving
    • 1 Loadbalancer = Muliti Serviceʹมߋ
    • ALBͷϧʔςΟϯάϧʔϧ͸100ݸ͕ϋʔυϦϛοτ
    • ސ٬ͷAPI͕ͲΜͲΜ৐ΔͨΊ͙͢ʹഁ୼͢Δ͜ͱ͕ݟ͑ͯͨ
    • ·ͨ΋΍ࣗલͰGatewayΛ։ൃʢECS + Lambda + DynamoDBʣ orz
    • ސ٬ͷΞϓϦέʔγϣϯ͕ৗ࣌Τϥʔൃੜɻίϯςφ͕࠶ىಈ͠·͘ΓEBSόʔετΫϨδοτΛ৯͍ͭͿ͢
    • ECS Sevice͸ৗʹىಈ͢Δ࢓༷ͰαʔΩοτɾϒϨΠΫ͸ແ͍
    • γεςϜىҼͷΤϥʔͰͷαʔΩοτɾϒϨΠΫ͸ޙ೔࣮૷͞Ε͕ͨɺΞϓϦىҼͷαʔΩοτɾϒϨΠ
    Ϋ͸·ͩແ͍
    • ͳͷͰ࠶ىಈ܁Γସ͑͠IO৯͍ͭͿ͢ʢѹ౗తϊΠδʔωΠόʔʣ
    • ࣗલʢLambdaʣͰϔϧενΣοΫͯ͠ɺҰఆճ਺Failͷ৔߹͸ϧʔςΟϯά͠ͳ͍Α͏ʹࣗલGatewayͷ
    DynamoDBʹอଘ

    View Slide

  28. ୈҰੈ୅ΞʔΩςΫνϟ of Serving
    • Blue/Greenػೳͷ࣮૷
    • APIͷΤϯυϙΠϯτͷ޲͖ઌͱͳΔϞσϧΛ؆୯ʹ੾Γସ͑ΒΕΔػೳ
    • ECSλεΫ͔ΒϗετͷIAMϩʔϧ͕৮Εͯ͠·͏໰୊
    • iptablesͰmetadata΁ͷΞΫηεΛͿͬͨ੾Δ
    • ࠷ۙ͸ɺawspvc ωοτϫʔΫϞʔυͷ৔߹ͷ৔߹͸ҎԼͷΑ͏ʹ؆୯
    • ECS_AWSVPC_BLOCK_IMDS: true

    View Slide

  29. ୈҰੈ୅ΞʔΩςΫνϟ of Serving
    Good
    • 1ਪ࿦API = 1ELBͰ͸ແ͘ͳΔͨΊίετμ
    ΢ϯ
    • Blue/Green͕Ͱ͖ΔΑ͏ʹ
    • EBSόʔετ໰୊ղܾ
    • EC2 metadataΞΫηεͰ͖ͳ͍Α͏ʹ
    Bad
    • ࣗલ࣮૷͕ଟ͘ͳ͖ͬͯͯϝϯςίετ
    ͕͕͕ɻͱ͸͍͑ɺAWSͷػೳͰ͸Χ
    όʔͯ͠ͳ͍՝୊͸ࣗલ࣮૷͔͠ͳ͍

    View Slide

  30. ୈೋੈ୅ΞʔΩςΫνϟ of Serving
    • ECSͱEC2ͷ૬ੑ͕ѱ͍໰୊
    • AutoScalingൃಈ࣌ʹίϯςφΛແࢹͯ͠EC2ΛTerminate͢ΔɻDrainͯ͠Αʢࠓ͸ղফ͞Ε͍ͯΔʁʣ
    • AutoScalingൃಈ৚݅͸CPU༧໿ྔϕʔε͕جຊͳͷͰɺίϯςφαΠζͷεέʔϧ͕͔ͳΓखؒͩ͠ɺܭ
    ࢉϩδοΫ΋ΊΜͲ͍͘͞
    • ۭ͖Ϧιʔε͕͋ͬͯ΋ू໿͞Εͳ͍
    • ΠϯελϯεΛೖΕସ͑ΔࡍͷBlue/Green Deployment͸ࣗલ࣮૷
    • ސ٬ͷਪ࿦APIΛ୔ࢁࡌͤΔͱίετ͕ංେԽ͢Δ
    • εϙοτΠϯελϯεΛ͏·͘ѻ͑ͳ͍͔ݕ౼
    • ΠϯελϯελΠϓ΍κʔϯΛࢄΓ͹ΊͨΓɺεϙοτΠϯελϯε͕ചΓ੾Εͨ࣌ʹΦϯσϚϯυʹ੾
    Γସ͑ͨΓ͢Δඞཁ͕͋Δ
    • ݕূͷ݁ՌɺSpotinstͱ͍͏αʔϏεΛೖΕΔ͜ͱʹ

    View Slide

  31. Spotinstʹ͍ͭͯ͸ͪ͜ΒΛ

    View Slide

  32. Bad
    • εϙοτΠϯελϯεͷద༻ൣғΛؒҧ
    ͑Δͱҙਤ͠ͳ͍ఀࢭ͕ൃੜ͢Δ
    ʢJupyter Notebookͱ͔ʣ
    ୈೋੈ୅ΞʔΩςΫνϟ of Serving
    Good
    • SpotinstΛར༻͢Δ͜ͱʹΑΓECS͕଍
    Γͳ͍ͱ͜ΖΛΧόʔ
    • Blue/Green Deployment͕ڧ੍͞ΕΔ
    • ू໿ޮ཰্͕͕Γɺ60-70%ίετμ΢ϯ

    View Slide

  33. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving
    • 1 ServiceͰ 300ίϯςφΛಈ͔͢πϫϞϊ͕ग़͖ͯͨ
    • DynamoDBͷύʔςΟγϣχϯάͷภΓ͕ൃੜ͠εϩοτϦϯά͕େྔʹ
    • Kubernetesػӡ͕ߴ·͖ͬͯͨɻKubernetesΛ׆༻͢Δ͜ͱͰ
    • API Gateway૬౰ͷػೳ͸Ambassador(Envoy)Ͱ୅༻Մೳ
    • ͔͠͠ɺService/Pod͕େྔʹଘࡏ͢ΔͱEnvoyͷϧʔςΟϯάϧʔϧͷߋ৽ʹ਺ेඵ͔͔Δ͜ͱ΋
    • Service DiscoveryɺϔϧενΣοΫ͸ඪ४ػೳͰ೚ͤΒΕΔ
    • Service؀ڥ΋KubernetesԽͱͱ΋ʹࣗલ࣮૷෦෼ͷݮΒ͢ํ޲ʹ

    View Slide

  34. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving
    Good
    • ϔϧενΣοΫɾαʔϏεσΟεΧόϦΛࣗ
    લ࣮૷͔ΒKubernetesͷҰൠతͳػೳʹஔ͖
    ׵͑
    • GatewayͷػೳΛKubernetesͷ֦ுػೳ
    ʢAmbassadorʣʹஔ͖׵͑
    Bad
    • Ambassador(Envoy)ͷཧղ͕ඞཁʹ
    • ࣗલ࣮૷ΑΓ͸༷ʑͳ໘Ͱߟྀ͞Ε͍ͯ
    ͯϝϯςίετ͸Լ͕Δ͕ΧελϚΠζ
    ੑ͸མͪΔ

    View Slide

  35. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving
    • ݱঢ়ͷ՝୊
    • ސ٬ຖͷPodͷΞ΢τό΢ϯυͷసૹྔͷܭଌํ๏͕Θ͔Βͳ͍
    • GKEʹ͸͋ΔΒ͍͠ɻIstioͷग़൪͔ɾɾʁ
    • ސ٬ຖʹίετΛՄࢹԽ͢ΔͨΊʹKubernetesͷeventΛhook͠·͘Βͳ͍ͱ͍͚ͳ͍
    • ͜ΕGKE͸usage meteringͱ͍͏ͷͰग़དྷΔΆ͍ͷͰEKSͰ΋ͥͻ΍ͬͯ΄͍͠ɻࣄۀ෦ϚϧνςφϯτͰ΋ඞཁͱࢥ͏
    • Kubernetes ͷ݁Ռ੔߹ੑͷৼΔ෣͍ͷ্ʹࣗ෼ͨͪͷγεςϜΛߏங͢Δ೉͠͞
    • ྫ͑͹
    • Pod ͷ STATUS ͕ Running ʹͳΔɻૄ௨ग़དྷΔ͔ͱࢥ͍͖΍ Ready (readinessProbe) ͕ 0/1
    • Ready ͕ 1/1ʹͳΔɻૄ௨Ͱ͖Δ͔ͱࢥ͍͖΍ Ambassador (Envoy) ͷͱ͋Δ Pod ͸ૄ௨Ͱ͖Δ͕ɺͱ͋Δ Pod ͸ߋ
    ৽଴ͪͷͨΊૄ௨Ͱ͖ͳ͍ɻ෼ࢄγεςϜʹ͓͚Δ݁Ռ੔߹ͳͷͰʮ͍ͭʯ͔Β࢖͑ΔΑ͏ʹͳͬͨঢ়ଶ͔ΛϢʔβ
    ʹ஌ΒͤΔͷ͕೉͍͠
    • ͦΕͧΕͷػೳ͕݁Ռ੔߹Λ୲อ͍ͯ͠Δ͕ނʹɺ࿈ಈͯ͠ཉ͍͠ͱ͜ΖΛຒΊΔඞཁ͕͋Δ

    View Slide

  36. ୈࡾੈ୅ΞʔΩςΫνϟ of Serving
    • ݱঢ়ͷ՝୊
    • IPΞυϨεͷރׇ
    • 1ΠϯελϯεลΓ Serving = 20ݸɺTraining = 30ݸ Λ࢖༻͍ͯ͠Δɻ20ݸ * 400 Service ͕࡞ΒΕͨ࣌ʹ8,000ݸͷIP
    Λ࢖༻ɻ·͔͞ͷ /16 αϒωοτ͕ރׇ
    • એݴతʹͳΓͮΒ͍
    • SDKͰσϓϩΠͯ͠ΔͷͰʮyamlΛ΋͏Ұ౓ద༻ͨ͠Β࠶ݱͰ͖ΔΑʯঢ়ଶʢએݴతʣʹ͸ͳ͍ͬͯͳ͍ɻ൵͍͠
    • ؂ࢹ͠ਏ͍
    • ϢʔβʔىҼͷ໰୊ͱϓϥοτϑΥʔϜͷ໰୊ͷ੾Γ෼͚ํ๏͕೉͍͠
    • ސ٬ىҼͰࢮΜͰΔpod͕ଘࡏ͢Δɻશ͕ͯਖ਼ৗʹՔಇ͍ͯ͠Δ༁Ͱ͸ͳ͍
    • ಠࣗGatewayͰ5xx/4xxͷ؂ࢹ͕Ͱ͖ͳ͍ɻόοΫΤϯυ͕ސ٬ґଘͷͨΊ5xx͸े෼༗ΓಘΔ
    • ϚΠΫϩαʔϏε͋Δ͋Δ
    • LB/Proxy͕ଟஈͳͷͰௐࠪͮ͠Β͍ɻސ٬ʹͲ͜·ͰϩάɾϝτϦΫεΛग़ͤ͹ྑ͍ͷ͔೉͍͠

    View Slide

  37. Training

    View Slide

  38. ୈҰੈ୅ΞʔΩςΫνϟ of Training
    • ֶशδϣϒΛ࣮ߦ͢Δج൫
    • ॳظόʔδϣϯ͔ΒKubernetesϕʔεͰ࣮૷ͨ͠
    • JobɺPodʢartifactอଘ༻ίϯςφʣͳͲΛECSͰࣗલ࣮૷͸ඇޮ཰ͩͬͨͨΊ
    • ͔͠͠EKS͸ແ͔ͬͨͷͰ on EC2 Ͱ
    • ސ٬ͷίϯςφͱ؅ཧܥίϯςφ͸PodΛ෼͚ͨ
    • TrainingޙͷϞσϧΛs3ʹࣗಈอଘ͢ΔίϯςφɺϩάΛऩू͢Δίϯςφ͸ɺϢʔβͷίʔυͱಉډͤ͞ΔͱIAMͷݖ
    ݶతʹྑ͘ͳ͍ͷͰɺผPodͱͯ͠agentతʹىಈ
    • kube2iam Λ༻͍ͯPodຖʹIAMϩʔϧΛΞλονʢࠓ͸ެ͕ࣜग़͍ͯΔʁʣ
    • privileged͸Կ͕͋ͬͯ΋off
    • GPUυϥΠόपΓ͸ۤ࿑͢Δ͚ͲؤுΔ

    View Slide

  39. ୈҰੈ୅ΞʔΩςΫνϟ of Training
    Good
    • ECS্Ͱंྠͷ࠶ൃ໌͕ෆཁʹͳͬͨ
    Bad
    • Kubernetes on EC2 ͸ӡ༻͕݁ߏਏ͍

    View Slide

  40. ୈೋੈ୅ΞʔΩςΫνϟ of Training
    • Jupyter Notebook͕ཉ͍͠
    • Tensorboard͕ݟ͍ͨ
    • Jobؒ΍NotebookؒͰσʔληοτ΍݁ՌΛڞ༗͍ͨ͠

    View Slide

  41. ୈೋੈ୅ΞʔΩςΫνϟ of Training
    Good
    • Jupyter NotebookɺTensorboradΛఏڙ
    • ڞ༗ετϨʔδͷఏڙ
    Bad
    • Kubernetes on EC2 ͸ӡ༻͕݁ߏਏ͍
    • Jupyterະ࢖༻࣌ͷՔಇίετ͕ແବ
    • ڞ༗ετϨʔδ͕ߴͯ͘NFSͳͷͰ஗͍

    View Slide

  42. ୈࡾੈ୅ΞʔΩςΫνϟ of Training
    • EKS͕ग़ͨͷͰࡌͤସ͍͑ͨ
    • Jupyter͸࢖ͬͯͳ͚Ε͹ࣗಈͰམͱͯ͠ཉ͍͠
    • EFSͷίετ͕ංେԽ͢ΔͷͰS3ʹࣗಈͰ໭͠ɺEFSͷத਎͸ࣗಈফ͍ͨ͠

    View Slide

  43. ୈࡾੈ୅ΞʔΩςΫνϟ of Training
    Good
    • EKSҠߦʹΑΓӡ༻ෛՙݮΔ
    • EFS͔ΒS3΁ͷॻ͖ࠐΈɾ໭͠ʹΑΓί
    ετμ΢ϯ
    • JupyterͷࣗಈఀࢭʹΑΓίετμ΢ϯ
    Bad
    • EFS͕NFSͳͷͰ஗͍

    View Slide

  44. ୈࡾੈ୅ΞʔΩςΫνϟ of Training
    • ݱঢ়ͷ՝୊
    • ೥ؒܭըͷച্ɾݪՁ
    • ਫ਼៛Խ͢Δඞཁ͕͋Δ͕ɺސ٬࣍ୈͳͷͰݪՁΛ༧ଌ͢Δͱ͔΋͸΍Α͘෼͔ΒΜɻAWS͞ΜͲ͏΍ͬͯ
    ؅ཧ͍ͯ͠ΔͷͩΖ͏
    • όά
    • p3.16xlarge͕ఀࢭͤͣʹPສҐ͔͔ͬͯΔ࣌΋͋ͬͨ
    • OS
    • nvidia-driver͕αϙʔτ͢ΔOSͰ͋Δඞཁ͕͋Δɻͭ·ΓUbuntu or Amazon Linux2ɻʮBottlerocketʯͷ
    nvidia-driverαϙʔτظ଴

    View Slide

  45. ୈࡾੈ୅ΞʔΩςΫνϟ of Training
    • ݱঢ়ͷ՝୊
    • ίϯςφؒͷґଘؔ܎
    • ϩάΛ࿙Εͳ͘ऩू͢ΔͨΊʹ Affinity Λۦ࢖ͯ͠ log collector pod -> platform agent pod -> training job
    ͱ͍͏༏ઌॱҐΛ෇͚ͯPodΛىಈͤ͞Δͱ ɺඞཁͳϦιʔε͕଍Γͳ͍EventͷൃՐ͕஗ΕΔͨΊ
    Autoscaler ΁ͷ௨஌͕஗ΕɺΠϯελϯεͷىಈ͕஗͘ͳΔ
    • DockerΠϝʔδɾύοέʔδͷޓ׵ੑɺαϙʔτ
    • ఏڙ͢ΔDockerΠϝʔδͷޓ׵ੑҡ͕࣋೉͍͠
    • αϙʔτର৅
    • DLϥΠϒϥϦͷछྨ x όʔδϣϯ x Pythonόʔδϣϯ x CUDAͷόʔδϣϯ …

    View Slide

  46. Trigger

    View Slide

  47. ୈҰੈ୅ΞʔΩςΫνϟ of Trigger
    • ਪ࿦δϣϒΛ࣮ߦ͢Δج൫
    • Datalakeʹσʔλ͕౤ೖ͞Εͨ͜ͱΛτϦΨʔʹൃಈ
    • S3->SNS->SQSͰɺҰ୴ΩϡʔΠϯά͢Δ
    • SQSͷQueue͔ΒSubscriber͕औಘ͠ɺAWS Batch΁λεΫΛ౤͛Δ
    • δϣϒ͕ऴΘΕ͹Πϯελϯε͕ࣗಈఀࢭ͢Δ
    • ֖Λ։͚Ε͹1෼ʹԿඦͱδϣϒ͕౤͛ΒΕΔ

    View Slide

  48. ୈҰੈ୅ΞʔΩςΫνϟ of Trigger
    Good
    • ٸܹͳෛՙ͸SQS͕ٵऩ
    • Subscriber͸Queueͷ਺Ͱεέʔϧ
    • AWS Batch͸౤͛ͨΒྑ͍͚ͩ
    Bad
    • ىಈ͕஗͍ɺϝτϦΫεແ͍ɺϩά͕ू໿
    ͞ΕͯJob୯ҐͰݟΕͳ͍
    • AWS BatchͷϢʔεέʔεʹ߹ͬͯͳ͔ͬͨ
    • AZؒͷωοτϫʔΫసૹྔ͕Ϡό͍

    View Slide

  49. Logging for Customer

    View Slide

  50. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for Customer
    • ސ٬ʹఏڙ͢Δϩάج൫
    • ServingͱTrainingɺTriggerͷϩάΛఏڙ
    • ϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͨΊɺग़དྷΔݶΓࣗલͰӡ༻ͨ͘͠ͳ͍

    View Slide

  51. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for Customer
    Good
    • αʔόϨεܥʢCWLogs / Kinesiss /
    Lambda / DynamoDBʣͳͷͰӡ༻؅ཧෆ

    Bad
    • ϩάྔ͕ٸ૿͢ΔͱLambda͔
    DynamoDBͰεϩοτϦϯάى͖Δ
    • CWLogs࣮࣭࢖ͬͯͳׂ͍ʹ݁ߏߴ͍

    View Slide

  52. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for Customer
    • ϩάͷόοΫΤϯυΛDatadog Logsʹมߋ
    • ServiceɺTrainingͷϩάΛDatadog Logsʹอଘ
    • Datadog Logs ͷAPI͔ΒϩάΛऔಘ͠ސ٬ʹఏڙ
    • ElasticSearchʹ͢Δ͔ߟ͕͑ͨ
    • ݕ౼࣍఺Ͱϩάྔ͸5ԯϨίʔυ/݄
    • ࣄۀͷ੒௕ͱϩΪϯάର৅Λ૿΍͢͜ͱΛߟ͑Δͱ3ϲ݄ຖʹഒʑʹ૿͑Δ
    • ഒʑʹ૿͑ΔElasticSearchΛӡ༻ͨ͘͠ͳ͔ͬͨ͠ɺϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͷͰӡ༻ί
    ετΛֻ͚ͨ͘ͳ͔ͬͨ

    View Slide

  53. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for Customer
    Good
    • ϑϧϚωʔδυͳͷͰӡ༻ϑϦʔ
    Bad
    • ͓͕͔͔ۚΔɻͱ͸͍͑ɺCWLogsΑΓ҆͘
    ࣗલͰӡ༻͢ΔΑΓϚγ
    • Datadogͷ࢓༷ʹҾͬுΒΕΔ
    • ݁Ռ੔߹ɺॱংอূແ͠
    • datadog-agentͷڍಈɾ࢓༷

    View Slide

  54. Logging for System & Application

    View Slide

  55. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for System & Application
    • ࣾ಺Ͱར༻͢ΔγεςϜϩάɾΞϓϦέʔγϣϯϩάͷऩूج൫
    • ϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͨΊɺग़དྷΔݶΓࣗલͰӡ༻ͨ͘͠ͳ͍
    • ECS΍LambdaΛத৺ʹར༻͍ͯͨͨ͠ΊCloudWatch Logsʹϩά͸֨ೲ͞Ε͍ͯͨ
    • ͱΓ͋͑ͣCW LogsΛར༻ͨ͠

    View Slide

  56. ୈҰੈ୅ΞʔΩςΫνϟ of Logging for System & Application
    Good
    • Πϯϑϥͷ͜ͱ͸ߟ͑ͳͯ͘ྑ͍
    Bad
    • Ͳ͜ʹ֨ೲ͞Ε͍ͯΔ͔෼͔Βͳ͍
    • ݕࡧੑօແɻ໨grepྗ্͕Δ
    • ϚΠΫϩαʔϏεؒͷϩάௐࠪͱ͔͔ͳΓπ
    ϥϛ͔͠ͳ͍
    • ݁Ռɺ໰୊͔͋ͬͨ࣌͠ϩάΛݟͳ͍
    • Ҏ֎ʹCWLogs͸ߴ͍

    View Slide

  57. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for System & Application
    • CWLogs΁ͷసૹ͸ࣙΊͯɺDatadog Logsʹू໿ͨ͠

    View Slide

  58. ୈೋੈ୅ΞʔΩςΫνϟ of Logging for System & Application
    Good
    • Πϯϑϥͷ͜ͱ͸ߟ͑ͳͯ͘ྑ͍
    • ҰՕॴͰશ෦ݕࡧग़དྷΔ
    • Tag, Attributeʹରͯ͠IndexுΕͯݕࡧର৅ʹ
    Ͱ͖Δ
    • Ͳͷ߲໨ʹԿ݅͋Δ͔Ұ໨ྎવ
    • ϚΠΫϩαʔϏεͳͷͰɺTracing IDͳͲΛຒ
    ΊࠐΜͰ௥͍΍ͨ͘͢͠Γ
    • APIͷ܏޲෼ੳͨ͠ΓɺӨڹൣғௐ΂ͨΓͱ
    ׆༻͕޿͕ͬͨ
    Bad
    • ͓͕͔͔ۚΔɻͱ͸͍͑ɺCWLogsΑΓ҆ࣗ͘
    લͰӡ༻͢ΔΑΓϚγ

    View Slide

  59. ͰɺݱࡏʹࢸΔ

    View Slide

  60. ݁࿦
    • ސ٬ͷΞϓϦέʔγϣϯίʔυ͕ಈ͘ϓϥοτϑΥʔϜ
    Λ࡞ͬͯΔAWS͞ΜεΰΠ

    View Slide

  61. Ҏ্

    View Slide