Slide 1

Slide 1 text

ސ٬ͷΞϓϦέʔγϣϯίʔυ͕ಈ͘ Ϛϧνςφϯτ؀ڥʹ͓͚Δ՝୊ͱEKSʹͨͲΓண͘·Ͱ ɹय़ͷAWS ίϯςφࡇΓ with Amazon EKS ABEJA, Inc. Shogo Muranushi

Slide 2

Slide 2 text

Shogo Muranushi ABEJA, Inc. - Site Reliability Engineer Tech Lead

Slide 3

Slide 3 text

ΞδΣϯμ • ࿩ͷഎܠͱͳΔࣄۀ঺հ • ΞʔΩςΫνϟͷਐԽͱϚϧνςφϯτͷ޻෉఺

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

σʔλ औಘ σʔλ ஝ੵ σʔλ ֬ೝ ڭࢣσʔλ ࡞੒ Ϟσϧ ઃܭ ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश σʔλ΢ΣΞϋ΢ε ͷ४උͱ؅ཧ σʔλͷόϦσʔγϣϯʢਖ਼֬ੑʣͷ֬ೝ 0͔ΒͷϞσϧઃܭ GPU؀ڥͷ४උͱ ߴ౓ͳ෼ࢄԽ σʔλɺϞσϧɺ݁Ռͷόʔδϣϯ؅ཧ ౷ܭతʹຊ൪ʹσϓϩΠͨ͠ॠؒ ͔Βਫ਼౓͕Լ͕Δ͜ͱΛ୲อ େྔσʔλͷऔಘʹඞཁͳAPI΍ෛՙ෼ࢄ ͷ࢓૊Έ΍४උɺηΩϡϦςΟ୲อ ڭࢣσʔλͷ࡞੒ʹඞཁͳπʔϧͱਓࡐͷ४උ ։ൃ؀ڥ͔Βຊ൪؀ڥ΁ͷҾ͖౉͠ ৑௕ੑ΍GPUϦιʔεͷ୲อɺ Τοδଆͱͷ࿈ܞϓϩηεߏங

Slide 7

Slide 7 text

σʔλ औಘ σʔλ ஝ੵ σʔλ ֬ೝ ڭࢣσʔλ ࡞੒ Ϟσϧ ઃܭ ֶश ධՁ σϓϩΠ ਪ࿦ ࠶ֶश σʔλ΢ΣΞϋ΢ε ͷ४උͱ؅ཧ σʔλͷόϦσʔγϣϯʢਖ਼֬ੑʣͷ֬ೝ 0͔ΒͷϞσϧઃܭ GPU؀ڥͷ४උͱ ߴ౓ͳ෼ࢄԽ σʔλɺϞσϧɺ݁Ռͷόʔδϣϯ؅ཧ ౷ܭతʹຊ൪ʹσϓϩΠͨ͠ॠؒ ͔Βਫ਼౓͕Լ͕Δ͜ͱΛ୲อ େྔσʔλͷऔಘʹඞཁͳAPI΍ෛՙ෼ࢄ ͷ࢓૊Έ΍४උɺηΩϡϦςΟ୲อ ڭࢣσʔλͷ࡞੒ʹඞཁͳπʔϧͱਓࡐͷ४උ ։ൃ؀ڥ͔Βຊ൪؀ڥ΁ͷҾ͖౉͠ ৑௕ੑ΍GPUϦιʔεͷ୲อɺ Τοδଆͱͷ࿈ܞϓϩηεߏங AI׆༻·Ͱʹ਺ଟ͘ͷ՝୊͕ଘࡏ

Slide 8

Slide 8 text

Ref: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf “ As the machine learning (ML) community continues to accumulate years of experience with live systems ” “ ։ൃ͓ΑͼMLγεςϜΛಋೖ͢Δ͜ͱ͸ൺֱతߴ଎Ͱ҆ՁͰ͕͢ɺ࣌ؒΛ͔͚ͯ ͦΕΛҡ࣋͢Δ͜ͱ͸ࠔ೉͔ͭߴՁͰ͋Δ”

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

ΞʔΩςΫνϟ

Slide 11

Slide 11 text

ΞʔΩςΫνϟ ݱঢ়͸EKSϕʔεͰ͕ͨ͠ɺ։ൃॳظ͸ECSΛϕʔεͰ
 ৭ʑ޻෉͠ͳ͕Βࠓͷߏ੒ʹͨͲΓண͖·ͨ͠

Slide 12

Slide 12 text

ΞʔΩςΫνϟ ͦ͜ʹࢸΔ·ͰͷΞʔΩςΫνϟͷਐԽͷաఔͰ ۤ࿑ͨ͠఺ɾ޻෉ͨ͠఺ͱɺ
 ʮސ٬ͷίʔυ͕ಈ͘ʯϚϧνςφϯτʢʁʣ؀ڥԼͰ
 KubernetesΛ༻͍ͯϓϥοτϑΥʔϜΛߏங͢Δࡍͷ
 ۤ࿑ͨ͠఺ɾ޻෉ͨ͠఺Λ࿩͠·͢

Slide 13

Slide 13 text

ʲલఏʳϚϧνςφϯτͱ͸ʁ

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

ͻͱͪ͘ʹϚϧνςφϯτ ͱݴͬͯ΋৭ʑ͋Γ·͕͢ɺ ๻ͨͪͷ؀ڥ͸ͪ͜ΒͰ͢

Slide 16

Slide 16 text

ΞʔΩςΫνϟ

Slide 17

Slide 17 text

Datalake

Slide 18

Slide 18 text

ୈҰੈ୅ΞʔΩςΫνϟ of Datalake • Raw σʔλΛ஝ੵ͢ΔͨΊͷαʔϏε • ετϨʔδ͸ S3 • ॳظόʔδϣϯ͸ API Gateway ͱ Lambda Ͱ REST API Ͱఏڙ • ౰ॳ͸ϑΝΠϧͷอଘɺऔಘɺҰཡͳͲͷΦϖϨʔγϣϯͷΈ • ߏஙʹ͸Serverless FrameworkΛར༻ • Signed URLΛൃߦ͠Ξοϓϩʔυͯ͠΋Β͏࢓༷

Slide 19

Slide 19 text

ୈҰੈ୅ΞʔΩςΫνϟ of Datalake Good • ϝϯςφϯεϑϦʔ Bad • API GatewayͭΒ͍ • ϩʔΧϧͷ࠶ݱੑ͕௿͘։ൃޮ཰ѱ͍ • ϖΠϩʔυαΠζ • Serverless Framework·͊·͊ਏ͍ • ࠓޙଞͷαʔϏε΋ಉ͡ελΠϧͰ։ൃʁ

Slide 20

Slide 20 text

ୈೋੈ୅ΞʔΩςΫνϟ of Datalake • AWSͷAPI Gateway͸γϯυ͍ͷͰɺAPI GatewayΛ಺੡ • ։ൃޮ཰͕ѱ͍ɺେࣄͳΤϯυϙΠϯτͳͷͰো֐࣌ͷίϯτϩʔϧ͸͔ͨͬͨ͠ • API Gateway͕෼཭͞Εͨ͜ͱͰɺೝূɾϧʔςΟϯά͕ڞ௨ʹ • ୔ࢁͷAPI Gatewayͱ͸͓͞Β͹ • Datalake͸LambdaͱS3ͷΈʹ • ౰໘͸͜ͷߏ੒

Slide 21

Slide 21 text

ୈೋੈ୅ΞʔΩςΫνϟ of Datalake Good • ೝূɾϧʔςΟϯάΛ෼཭͠ڞ௨ʹ • Datalakeͷ੹೚ൣғΛ෼཭ Bad • ࣗલAPI Gatewayͷ໘౗ΛݟΔඞཁ͋Γ

Slide 22

Slide 22 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Datalake • ʮݕࡧ΍Χ΢ϯτ͕͍ͨ͠ʯ • ϝλσʔλݕࡧ΍Χ΢ϯτػೳ༻ͷDBΛߏங • ݕࡧ͸PostgreSQL(Aurora)ͷJSON+GIN IndexͰ࣮૷ • ॊೈͳϝλσʔλͷ෇༩ͱݕࡧΛͰ͖ΔΑ͏ʹ͠ա͗ͯIndexരൃ • Cassandra…? Or ॊೈੑΛܰݮͤ͞Δ͜ͱΛݕ౼த • S3 Event + SQS + Lambda ͔Β Datalake API Λݺͼग़͢ • όοΫΤϯυͷෛՙ͕଱͖͑Εͳ͔ͬͨɺCW Logs͕ߴ͘ͳͬͨͷͰόοΫΤϯυΛLambda ͔ΒECSʹҠߦ • ʮS3ͷSigned URL͸खؒͳͷͰҰճͰDatalakeʹΞοϓϩʔυ͍ͨ͠ʯ • S3΁ͷPut͸API GatewayͰ୲͏Α͏ʹ࢓༷มߋ

Slide 23

Slide 23 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Datalake Good • Signed URL͸ෆཁʹͳΓUXվળ • ϝλσʔλݕࡧͰ͖ΔΑ͏ʹ Bad • ΋͸΍αʔόϨε͸ແ͘ͳͬͨͷͰϝϯ ςίετ૿Ճ • ϝλσʔλػೳ͕ࣗ༝ա͗ͯIndexංେԽ Ͱਏ͍ όοΫΤϯυෛՙɺCWLogsίετ૿ՃʹΑΓ ECSʹมߋ

Slide 24

Slide 24 text

Serving

Slide 25

Slide 25 text

ୈྵੈ୅ΞʔΩςΫνϟ of Serving • ਪ࿦༻APIΛϗεςΟϯά͢ΔαʔϏε • ίϯςφ • ॳظόʔδϣϯ͸ϓϩτλΠϓͱͯ͠CloudFormationͰ ELB + ECS Service Λ࡞੒

Slide 26

Slide 26 text

ୈྵੈ୅ΞʔΩςΫνϟ of Serving Good • γϯϓϧ • ҰׅͰؔ࿈ϦιʔεΛ࡞੒ɾ࡟আͰ͖Δ Bad • 1Serviceຖʹ1ͭͷELB͸ແବ • CFnͰ࡞੒͞ΕΔͷʹ਺෼͔͔Γ஗͍ • CFn͸ඇಉظͳͷͰΤϥʔݕ஌͕೉͍͠

Slide 27

Slide 27 text

ୈҰੈ୅ΞʔΩςΫνϟ of Serving • 1 Loadbalancer = Muliti Serviceʹมߋ • ALBͷϧʔςΟϯάϧʔϧ͸100ݸ͕ϋʔυϦϛοτ • ސ٬ͷAPI͕ͲΜͲΜ৐ΔͨΊ͙͢ʹഁ୼͢Δ͜ͱ͕ݟ͑ͯͨ • ·ͨ΋΍ࣗલͰGatewayΛ։ൃʢECS + Lambda + DynamoDBʣ orz • ސ٬ͷΞϓϦέʔγϣϯ͕ৗ࣌Τϥʔൃੜɻίϯςφ͕࠶ىಈ͠·͘ΓEBSόʔετΫϨδοτΛ৯͍ͭͿ͢ • ECS Sevice͸ৗʹىಈ͢Δ࢓༷ͰαʔΩοτɾϒϨΠΫ͸ແ͍ • γεςϜىҼͷΤϥʔͰͷαʔΩοτɾϒϨΠΫ͸ޙ೔࣮૷͞Ε͕ͨɺΞϓϦىҼͷαʔΩοτɾϒϨΠ Ϋ͸·ͩແ͍ • ͳͷͰ࠶ىಈ܁Γସ͑͠IO৯͍ͭͿ͢ʢѹ౗తϊΠδʔωΠόʔʣ • ࣗલʢLambdaʣͰϔϧενΣοΫͯ͠ɺҰఆճ਺Failͷ৔߹͸ϧʔςΟϯά͠ͳ͍Α͏ʹࣗલGatewayͷ DynamoDBʹอଘ

Slide 28

Slide 28 text

ୈҰੈ୅ΞʔΩςΫνϟ of Serving • Blue/Greenػೳͷ࣮૷ • APIͷΤϯυϙΠϯτͷ޲͖ઌͱͳΔϞσϧΛ؆୯ʹ੾Γସ͑ΒΕΔػೳ • ECSλεΫ͔ΒϗετͷIAMϩʔϧ͕৮Εͯ͠·͏໰୊ • iptablesͰmetadata΁ͷΞΫηεΛͿͬͨ੾Δ • ࠷ۙ͸ɺawspvc ωοτϫʔΫϞʔυͷ৔߹ͷ৔߹͸ҎԼͷΑ͏ʹ؆୯ • ECS_AWSVPC_BLOCK_IMDS: true

Slide 29

Slide 29 text

ୈҰੈ୅ΞʔΩςΫνϟ of Serving Good • 1ਪ࿦API = 1ELBͰ͸ແ͘ͳΔͨΊίετμ ΢ϯ • Blue/Green͕Ͱ͖ΔΑ͏ʹ • EBSόʔετ໰୊ղܾ • EC2 metadataΞΫηεͰ͖ͳ͍Α͏ʹ Bad • ࣗલ࣮૷͕ଟ͘ͳ͖ͬͯͯϝϯςίετ ͕͕͕ɻͱ͸͍͑ɺAWSͷػೳͰ͸Χ όʔͯ͠ͳ͍՝୊͸ࣗલ࣮૷͔͠ͳ͍

Slide 30

Slide 30 text

ୈೋੈ୅ΞʔΩςΫνϟ of Serving • ECSͱEC2ͷ૬ੑ͕ѱ͍໰୊ • AutoScalingൃಈ࣌ʹίϯςφΛແࢹͯ͠EC2ΛTerminate͢ΔɻDrainͯ͠Αʢࠓ͸ղফ͞Ε͍ͯΔʁʣ • AutoScalingൃಈ৚݅͸CPU༧໿ྔϕʔε͕جຊͳͷͰɺίϯςφαΠζͷεέʔϧ͕͔ͳΓखؒͩ͠ɺܭ ࢉϩδοΫ΋ΊΜͲ͍͘͞ • ۭ͖Ϧιʔε͕͋ͬͯ΋ू໿͞Εͳ͍ • ΠϯελϯεΛೖΕସ͑ΔࡍͷBlue/Green Deployment͸ࣗલ࣮૷ • ސ٬ͷਪ࿦APIΛ୔ࢁࡌͤΔͱίετ͕ංେԽ͢Δ • εϙοτΠϯελϯεΛ͏·͘ѻ͑ͳ͍͔ݕ౼ • ΠϯελϯελΠϓ΍κʔϯΛࢄΓ͹ΊͨΓɺεϙοτΠϯελϯε͕ചΓ੾Εͨ࣌ʹΦϯσϚϯυʹ੾ Γସ͑ͨΓ͢Δඞཁ͕͋Δ • ݕূͷ݁ՌɺSpotinstͱ͍͏αʔϏεΛೖΕΔ͜ͱʹ

Slide 31

Slide 31 text

Spotinstʹ͍ͭͯ͸ͪ͜ΒΛ

Slide 32

Slide 32 text

Bad • εϙοτΠϯελϯεͷద༻ൣғΛؒҧ ͑Δͱҙਤ͠ͳ͍ఀࢭ͕ൃੜ͢Δ ʢJupyter Notebookͱ͔ʣ ୈೋੈ୅ΞʔΩςΫνϟ of Serving Good • SpotinstΛར༻͢Δ͜ͱʹΑΓECS͕଍ Γͳ͍ͱ͜ΖΛΧόʔ • Blue/Green Deployment͕ڧ੍͞ΕΔ • ू໿ޮ཰্͕͕Γɺ60-70%ίετμ΢ϯ

Slide 33

Slide 33 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Serving • 1 ServiceͰ 300ίϯςφΛಈ͔͢πϫϞϊ͕ग़͖ͯͨ • DynamoDBͷύʔςΟγϣχϯάͷภΓ͕ൃੜ͠εϩοτϦϯά͕େྔʹ • Kubernetesػӡ͕ߴ·͖ͬͯͨɻKubernetesΛ׆༻͢Δ͜ͱͰ • API Gateway૬౰ͷػೳ͸Ambassador(Envoy)Ͱ୅༻Մೳ • ͔͠͠ɺService/Pod͕େྔʹଘࡏ͢ΔͱEnvoyͷϧʔςΟϯάϧʔϧͷߋ৽ʹ਺ेඵ͔͔Δ͜ͱ΋ • Service DiscoveryɺϔϧενΣοΫ͸ඪ४ػೳͰ೚ͤΒΕΔ • Service؀ڥ΋KubernetesԽͱͱ΋ʹࣗલ࣮૷෦෼ͷݮΒ͢ํ޲ʹ

Slide 34

Slide 34 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Serving Good • ϔϧενΣοΫɾαʔϏεσΟεΧόϦΛࣗ લ࣮૷͔ΒKubernetesͷҰൠతͳػೳʹஔ͖ ׵͑ • GatewayͷػೳΛKubernetesͷ֦ுػೳ ʢAmbassadorʣʹஔ͖׵͑ Bad • Ambassador(Envoy)ͷཧղ͕ඞཁʹ • ࣗલ࣮૷ΑΓ͸༷ʑͳ໘Ͱߟྀ͞Ε͍ͯ ͯϝϯςίετ͸Լ͕Δ͕ΧελϚΠζ ੑ͸མͪΔ

Slide 35

Slide 35 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Serving • ݱঢ়ͷ՝୊ • ސ٬ຖͷPodͷΞ΢τό΢ϯυͷసૹྔͷܭଌํ๏͕Θ͔Βͳ͍ • GKEʹ͸͋ΔΒ͍͠ɻIstioͷग़൪͔ɾɾʁ • ސ٬ຖʹίετΛՄࢹԽ͢ΔͨΊʹKubernetesͷeventΛhook͠·͘Βͳ͍ͱ͍͚ͳ͍ • ͜ΕGKE͸usage meteringͱ͍͏ͷͰग़དྷΔΆ͍ͷͰEKSͰ΋ͥͻ΍ͬͯ΄͍͠ɻࣄۀ෦ϚϧνςφϯτͰ΋ඞཁͱࢥ͏ • Kubernetes ͷ݁Ռ੔߹ੑͷৼΔ෣͍ͷ্ʹࣗ෼ͨͪͷγεςϜΛߏங͢Δ೉͠͞ • ྫ͑͹ • Pod ͷ STATUS ͕ Running ʹͳΔɻૄ௨ग़དྷΔ͔ͱࢥ͍͖΍ Ready (readinessProbe) ͕ 0/1 • Ready ͕ 1/1ʹͳΔɻૄ௨Ͱ͖Δ͔ͱࢥ͍͖΍ Ambassador (Envoy) ͷͱ͋Δ Pod ͸ૄ௨Ͱ͖Δ͕ɺͱ͋Δ Pod ͸ߋ ৽଴ͪͷͨΊૄ௨Ͱ͖ͳ͍ɻ෼ࢄγεςϜʹ͓͚Δ݁Ռ੔߹ͳͷͰʮ͍ͭʯ͔Β࢖͑ΔΑ͏ʹͳͬͨঢ়ଶ͔ΛϢʔβ ʹ஌ΒͤΔͷ͕೉͍͠ • ͦΕͧΕͷػೳ͕݁Ռ੔߹Λ୲อ͍ͯ͠Δ͕ނʹɺ࿈ಈͯ͠ཉ͍͠ͱ͜ΖΛຒΊΔඞཁ͕͋Δ

Slide 36

Slide 36 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Serving • ݱঢ়ͷ՝୊ • IPΞυϨεͷރׇ • 1ΠϯελϯεลΓ Serving = 20ݸɺTraining = 30ݸ Λ࢖༻͍ͯ͠Δɻ20ݸ * 400 Service ͕࡞ΒΕͨ࣌ʹ8,000ݸͷIP Λ࢖༻ɻ·͔͞ͷ /16 αϒωοτ͕ރׇ • એݴతʹͳΓͮΒ͍ • SDKͰσϓϩΠͯ͠ΔͷͰʮyamlΛ΋͏Ұ౓ద༻ͨ͠Β࠶ݱͰ͖ΔΑʯঢ়ଶʢએݴతʣʹ͸ͳ͍ͬͯͳ͍ɻ൵͍͠ • ؂ࢹ͠ਏ͍ • ϢʔβʔىҼͷ໰୊ͱϓϥοτϑΥʔϜͷ໰୊ͷ੾Γ෼͚ํ๏͕೉͍͠ • ސ٬ىҼͰࢮΜͰΔpod͕ଘࡏ͢Δɻશ͕ͯਖ਼ৗʹՔಇ͍ͯ͠Δ༁Ͱ͸ͳ͍ • ಠࣗGatewayͰ5xx/4xxͷ؂ࢹ͕Ͱ͖ͳ͍ɻόοΫΤϯυ͕ސ٬ґଘͷͨΊ5xx͸े෼༗ΓಘΔ • ϚΠΫϩαʔϏε͋Δ͋Δ • LB/Proxy͕ଟஈͳͷͰௐࠪͮ͠Β͍ɻސ٬ʹͲ͜·ͰϩάɾϝτϦΫεΛग़ͤ͹ྑ͍ͷ͔೉͍͠

Slide 37

Slide 37 text

Training

Slide 38

Slide 38 text

ୈҰੈ୅ΞʔΩςΫνϟ of Training • ֶशδϣϒΛ࣮ߦ͢Δج൫ • ॳظόʔδϣϯ͔ΒKubernetesϕʔεͰ࣮૷ͨ͠ • JobɺPodʢartifactอଘ༻ίϯςφʣͳͲΛECSͰࣗલ࣮૷͸ඇޮ཰ͩͬͨͨΊ • ͔͠͠EKS͸ແ͔ͬͨͷͰ on EC2 Ͱ • ސ٬ͷίϯςφͱ؅ཧܥίϯςφ͸PodΛ෼͚ͨ • TrainingޙͷϞσϧΛs3ʹࣗಈอଘ͢ΔίϯςφɺϩάΛऩू͢Δίϯςφ͸ɺϢʔβͷίʔυͱಉډͤ͞ΔͱIAMͷݖ ݶతʹྑ͘ͳ͍ͷͰɺผPodͱͯ͠agentతʹىಈ • kube2iam Λ༻͍ͯPodຖʹIAMϩʔϧΛΞλονʢࠓ͸ެ͕ࣜग़͍ͯΔʁʣ • privileged͸Կ͕͋ͬͯ΋off • GPUυϥΠόपΓ͸ۤ࿑͢Δ͚ͲؤுΔ

Slide 39

Slide 39 text

ୈҰੈ୅ΞʔΩςΫνϟ of Training Good • ECS্Ͱंྠͷ࠶ൃ໌͕ෆཁʹͳͬͨ Bad • Kubernetes on EC2 ͸ӡ༻͕݁ߏਏ͍

Slide 40

Slide 40 text

ୈೋੈ୅ΞʔΩςΫνϟ of Training • Jupyter Notebook͕ཉ͍͠ • Tensorboard͕ݟ͍ͨ • Jobؒ΍NotebookؒͰσʔληοτ΍݁ՌΛڞ༗͍ͨ͠

Slide 41

Slide 41 text

ୈೋੈ୅ΞʔΩςΫνϟ of Training Good • Jupyter NotebookɺTensorboradΛఏڙ • ڞ༗ετϨʔδͷఏڙ Bad • Kubernetes on EC2 ͸ӡ༻͕݁ߏਏ͍ • Jupyterະ࢖༻࣌ͷՔಇίετ͕ແବ • ڞ༗ετϨʔδ͕ߴͯ͘NFSͳͷͰ஗͍

Slide 42

Slide 42 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Training • EKS͕ग़ͨͷͰࡌͤସ͍͑ͨ • Jupyter͸࢖ͬͯͳ͚Ε͹ࣗಈͰམͱͯ͠ཉ͍͠ • EFSͷίετ͕ංେԽ͢ΔͷͰS3ʹࣗಈͰ໭͠ɺEFSͷத਎͸ࣗಈফ͍ͨ͠

Slide 43

Slide 43 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Training Good • EKSҠߦʹΑΓӡ༻ෛՙݮΔ • EFS͔ΒS3΁ͷॻ͖ࠐΈɾ໭͠ʹΑΓί ετμ΢ϯ • JupyterͷࣗಈఀࢭʹΑΓίετμ΢ϯ Bad • EFS͕NFSͳͷͰ஗͍

Slide 44

Slide 44 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Training • ݱঢ়ͷ՝୊ • ೥ؒܭըͷച্ɾݪՁ • ਫ਼៛Խ͢Δඞཁ͕͋Δ͕ɺސ٬࣍ୈͳͷͰݪՁΛ༧ଌ͢Δͱ͔΋͸΍Α͘෼͔ΒΜɻAWS͞ΜͲ͏΍ͬͯ ؅ཧ͍ͯ͠ΔͷͩΖ͏ • όά • p3.16xlarge͕ఀࢭͤͣʹPສҐ͔͔ͬͯΔ࣌΋͋ͬͨ • OS • nvidia-driver͕αϙʔτ͢ΔOSͰ͋Δඞཁ͕͋Δɻͭ·ΓUbuntu or Amazon Linux2ɻʮBottlerocketʯͷ nvidia-driverαϙʔτظ଴

Slide 45

Slide 45 text

ୈࡾੈ୅ΞʔΩςΫνϟ of Training • ݱঢ়ͷ՝୊ • ίϯςφؒͷґଘؔ܎ • ϩάΛ࿙Εͳ͘ऩू͢ΔͨΊʹ Affinity Λۦ࢖ͯ͠ log collector pod -> platform agent pod -> training job ͱ͍͏༏ઌॱҐΛ෇͚ͯPodΛىಈͤ͞Δͱ ɺඞཁͳϦιʔε͕଍Γͳ͍EventͷൃՐ͕஗ΕΔͨΊ Autoscaler ΁ͷ௨஌͕஗ΕɺΠϯελϯεͷىಈ͕஗͘ͳΔ • DockerΠϝʔδɾύοέʔδͷޓ׵ੑɺαϙʔτ • ఏڙ͢ΔDockerΠϝʔδͷޓ׵ੑҡ͕࣋೉͍͠ • αϙʔτର৅ • DLϥΠϒϥϦͷछྨ x όʔδϣϯ x Pythonόʔδϣϯ x CUDAͷόʔδϣϯ …

Slide 46

Slide 46 text

Trigger

Slide 47

Slide 47 text

ୈҰੈ୅ΞʔΩςΫνϟ of Trigger • ਪ࿦δϣϒΛ࣮ߦ͢Δج൫ • Datalakeʹσʔλ͕౤ೖ͞Εͨ͜ͱΛτϦΨʔʹൃಈ • S3->SNS->SQSͰɺҰ୴ΩϡʔΠϯά͢Δ • SQSͷQueue͔ΒSubscriber͕औಘ͠ɺAWS Batch΁λεΫΛ౤͛Δ • δϣϒ͕ऴΘΕ͹Πϯελϯε͕ࣗಈఀࢭ͢Δ • ֖Λ։͚Ε͹1෼ʹԿඦͱδϣϒ͕౤͛ΒΕΔ

Slide 48

Slide 48 text

ୈҰੈ୅ΞʔΩςΫνϟ of Trigger Good • ٸܹͳෛՙ͸SQS͕ٵऩ • Subscriber͸Queueͷ਺Ͱεέʔϧ • AWS Batch͸౤͛ͨΒྑ͍͚ͩ Bad • ىಈ͕஗͍ɺϝτϦΫεແ͍ɺϩά͕ू໿ ͞ΕͯJob୯ҐͰݟΕͳ͍ • AWS BatchͷϢʔεέʔεʹ߹ͬͯͳ͔ͬͨ • AZؒͷωοτϫʔΫసૹྔ͕Ϡό͍

Slide 49

Slide 49 text

Logging for Customer

Slide 50

Slide 50 text

ୈҰੈ୅ΞʔΩςΫνϟ of Logging for Customer • ސ٬ʹఏڙ͢Δϩάج൫ • ServingͱTrainingɺTriggerͷϩάΛఏڙ • ϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͨΊɺग़དྷΔݶΓࣗલͰӡ༻ͨ͘͠ͳ͍

Slide 51

Slide 51 text

ୈҰੈ୅ΞʔΩςΫνϟ of Logging for Customer Good • αʔόϨεܥʢCWLogs / Kinesiss / Lambda / DynamoDBʣͳͷͰӡ༻؅ཧෆ ཁ Bad • ϩάྔ͕ٸ૿͢ΔͱLambda͔ DynamoDBͰεϩοτϦϯάى͖Δ • CWLogs࣮࣭࢖ͬͯͳׂ͍ʹ݁ߏߴ͍

Slide 52

Slide 52 text

ୈೋੈ୅ΞʔΩςΫνϟ of Logging for Customer • ϩάͷόοΫΤϯυΛDatadog Logsʹมߋ • ServiceɺTrainingͷϩάΛDatadog Logsʹอଘ • Datadog Logs ͷAPI͔ΒϩάΛऔಘ͠ސ٬ʹఏڙ • ElasticSearchʹ͢Δ͔ߟ͕͑ͨ • ݕ౼࣍఺Ͱϩάྔ͸5ԯϨίʔυ/݄ • ࣄۀͷ੒௕ͱϩΪϯάର৅Λ૿΍͢͜ͱΛߟ͑Δͱ3ϲ݄ຖʹഒʑʹ૿͑Δ • ഒʑʹ૿͑ΔElasticSearchΛӡ༻ͨ͘͠ͳ͔ͬͨ͠ɺϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͷͰӡ༻ί ετΛֻ͚ͨ͘ͳ͔ͬͨ

Slide 53

Slide 53 text

ୈೋੈ୅ΞʔΩςΫνϟ of Logging for Customer Good • ϑϧϚωʔδυͳͷͰӡ༻ϑϦʔ Bad • ͓͕͔͔ۚΔɻͱ͸͍͑ɺCWLogsΑΓ҆͘ ࣗલͰӡ༻͢ΔΑΓϚγ • Datadogͷ࢓༷ʹҾͬுΒΕΔ • ݁Ռ੔߹ɺॱংอূແ͠ • datadog-agentͷڍಈɾ࢓༷

Slide 54

Slide 54 text

Logging for System & Application

Slide 55

Slide 55 text

ୈҰੈ୅ΞʔΩςΫνϟ of Logging for System & Application • ࣾ಺Ͱར༻͢ΔγεςϜϩάɾΞϓϦέʔγϣϯϩάͷऩूج൫ • ϩά͸ίΞίϯϐλϯεͰ͸ͳ͍ͨΊɺग़དྷΔݶΓࣗલͰӡ༻ͨ͘͠ͳ͍ • ECS΍LambdaΛத৺ʹར༻͍ͯͨͨ͠ΊCloudWatch Logsʹϩά͸֨ೲ͞Ε͍ͯͨ • ͱΓ͋͑ͣCW LogsΛར༻ͨ͠

Slide 56

Slide 56 text

ୈҰੈ୅ΞʔΩςΫνϟ of Logging for System & Application Good • Πϯϑϥͷ͜ͱ͸ߟ͑ͳͯ͘ྑ͍ Bad • Ͳ͜ʹ֨ೲ͞Ε͍ͯΔ͔෼͔Βͳ͍ • ݕࡧੑօແɻ໨grepྗ্͕Δ • ϚΠΫϩαʔϏεؒͷϩάௐࠪͱ͔͔ͳΓπ ϥϛ͔͠ͳ͍ • ݁Ռɺ໰୊͔͋ͬͨ࣌͠ϩάΛݟͳ͍ • Ҏ֎ʹCWLogs͸ߴ͍

Slide 57

Slide 57 text

ୈೋੈ୅ΞʔΩςΫνϟ of Logging for System & Application • CWLogs΁ͷసૹ͸ࣙΊͯɺDatadog Logsʹू໿ͨ͠

Slide 58

Slide 58 text

ୈೋੈ୅ΞʔΩςΫνϟ of Logging for System & Application Good • Πϯϑϥͷ͜ͱ͸ߟ͑ͳͯ͘ྑ͍ • ҰՕॴͰશ෦ݕࡧग़དྷΔ • Tag, Attributeʹରͯ͠IndexுΕͯݕࡧର৅ʹ Ͱ͖Δ • Ͳͷ߲໨ʹԿ݅͋Δ͔Ұ໨ྎવ • ϚΠΫϩαʔϏεͳͷͰɺTracing IDͳͲΛຒ ΊࠐΜͰ௥͍΍ͨ͘͢͠Γ • APIͷ܏޲෼ੳͨ͠ΓɺӨڹൣғௐ΂ͨΓͱ ׆༻͕޿͕ͬͨ Bad • ͓͕͔͔ۚΔɻͱ͸͍͑ɺCWLogsΑΓ҆ࣗ͘ લͰӡ༻͢ΔΑΓϚγ

Slide 59

Slide 59 text

ͰɺݱࡏʹࢸΔ

Slide 60

Slide 60 text

݁࿦ • ސ٬ͷΞϓϦέʔγϣϯίʔυ͕ಈ͘ϓϥοτϑΥʔϜ Λ࡞ͬͯΔAWS͞ΜεΰΠ

Slide 61

Slide 61 text

Ҏ্