ϝϧΧϦͷMLج൫MLCT vol.5 hnakagawa
View Slide
ࣗݾհ• Hirofumi Nakagawa (hnakagawa)• 20177݄ೖࣾ• ॴଐSRE• σόΠευϥΠό։ൃ͔ΒϑϩϯτΤϯυ։ൃ·ͰΔԿͰ• NOT σʔλαΠΤϯςΟετ• https://github.com/hnakagawa
͓ࣄ• ML Platform։ൃ• σʔλαΠΤϯςΟετͱSREͷεΩϧΪϟοϓΛຒΊΔ• ML Reliability, SysML?, MLOps?• SREͷཱ͔ΒMLγεςϜͷࣗಈԽΛߦ͏
ML Platform• ͷML Platform• kubernetesϕʔε• طଘͷML FrameworkΛ༻͠؆୯ʹTraining/ServingΛߦ͏ڥΛఏڙ
ͦͷ͏ͪOSSͰެ։༧ఆ(ଟ
ϝϧΧϦͷMLར༻ࣄྫ• ײಈग़• ҧग़ݕ• Ձ֨αδΣετ• ΤΠταδΣετ ʑ…̍ઍສpredictionΛߦ͍ͬͯΔ
ML Platform Architecture,VCFSOFUFT$POUSPMMFS $-*$MVTUFS8PSLGMPX%BTICPBSE4UPSBHF(BUFXBZ.FUSJDT3VOOFS$PNQPOFOU.FSDBSJ.-$PNQPOFOU&YUFSOBM.JEEMFXBSF
Model Training & Serving Workflow
.-1MBUGPSN USBJOJOHDMVTUFSWorkflow for Production$*.-1MBUGPSN TFSWJOHDMVTUFSGPSUFTU.PEFM3FHJTUSZ+PC +PCɾɾ3&45 "1*4USFBNJOH5'4FSWJOHɾɾɾ
.-1MBUGPSN USBJOJOHDMVTUFSTraining Workflow$*.PEFM3FHJTUSZ+PC +PC ɾɾɾ1. GitHubͷpushΛτϦΨʹtrainingΛىಈ2. Training͞ΕͨModelModel Registry ্͕Δ
Serving Workflow.-1MBUGPSN TFSWJOHDMVTUFSGPSUFTU.PEFM3FHJTUSZ ɾɾ3&45 "1*4USFBNJOH5'4FSWJOH1. Model RegistryΛࢹͯࣗ͠ಈͰModel ΛServing2. Serving&Test͕ޭ͢Δͱຊ൪༻k8s manifestΛग़ྗ
Container Workflow%BUB4PVSDF *NBHF5FYUɹ1SFQSPDFTTJOH*NBHF&TUJNBUPS *NBHF17171JDUVSF1SFQSPDFTTJOH*NBHF17It’s own implementation
Model Serving APIͷߏྫ5FOTPS'MPX 4FSWJOH5'.PEFM5'.PEFM'MBTL4, .PEFM4, .PEFM4, .PEFMgRPC.FSDBSJ"1*RESTFlaskͰલॲཧΛߦ͍ ཪͷTensorFlow Servingʹ͍͛ͯΔ
Model Serving API Streaming ver ͷߏྫ5FOTPS'MPX 4FSWJOH5'.PEFM5'.PEFM.-1MBUGPSN'SBNFXPSL PS "QBDIF#FBN 4, .PEFM4, .PEFM4, .PEFMgRPCPubSub
ModelͱίϯςφɾΠϝʔδ• ڊେͳML ModelΛίϯςφɾΠϝʔδʹؚΊΔ͔൱͔• ؚΊͳ͍ͷͰ͋ΕԿॲʹஔ͢Δ͔• ϙʔλϏϦςΟੑͱϩʔυ࣌ؒͷτϨʔυΦϑ• ྑ͍ΞΠσΟΞ͕͋Εڭ͑ͯԼ͍͞…
௨ৗͷAPIͱಛੑ͕ҧ͏• ѻ͏ϦιʔεɺModelαΠζ͕େ͖͘ͳΔ߹͕ଟ͍(ඦMBʙGB)• CPUɾϝϞϦϦιʔεͷফඅ͕ܹ͍͠• ߹ʹΑͬͯGPU͏
ϝϞϦফඅ• ҧݕγεςϜͷPython࣮෦࣮ߦ࣌ʹ2GBϝϞϦΛফඅ͢Δˠࠓޙ͞Βʹ૿͑Δ༧ఆ͋Δ• Scikit-learnͰهड़͞Εͨલॲཧ෦͕େ͖͘ͳΓ͕ͪ
Pythonͱฒྻੑ• વThread͕͑ͳ͍(GILͷͨΊ)• ϓϩηεຖʹModelΛϩʔυ͢ΔͱඞཁͳϝϞϦαΠζ͕େ͖͘ͳΔˠ Blue-GreenDeployͷোʹͳΔ
ਖ਼PythonͰͷServing Πϯϑϥతʹਏ͍ࣄ͕ଟ͍…
ϝϞϦΛݡ͘͏• fork͢ΔલʹmodelΛϩʔυ͠Copy on WriteΛޮ͔͢• k8sͷone process per containerηΦϦ͋͑ͯഁ͍ͬͯΔ
Copy On Writeͷ෮शϝϞϦϓϩηε ࢠϓϩηε2.fork1BHF"1.allocation ಉ͡ྖҬΛࢀর
ϓϩηε͕ϝϞϦͷ༰Λ ॻ͖͑Δͱ…ϝϞϦϓϩηε ࢠϓϩηε1BHF" 1BHF#OS͕ผͷྖҬΛAllocationͯ͠ݩσʔλΛίϐʔ͢ΔผͷྖҬΛࢀর
Current Issues
ߴͳܧଓతϝϯςφϯε͕ඞཁ• MLػೳσʔλͷ͕มΘͬͨΓɺ༧֎ͷ͕ൃੜͨ͠Γͯ͠ɺͦΕΒʹରԠ͠ଓ͚Δඞཁ͕͋ΔMLػೳϦϦʔεޙେ͖ͳίετ͕͔͔Γଓ͚Δ
େ෯ͳࣗಈԽ͕ඞਢ
In Progress
ߴͳࣗಈԽ• ࣾͷσʔλ͔ΒFeature Extraction͢Δ࣮ΛίϯϙʔωϯτԽ• ಛఆͷΛղܾ͢ΔϞσϧߏஙΛ͋ΔఔࣗಈԽ• ϦϦʔεޙͷRe-TrainingɺHyper parameteroptimizationɺDeployΛࣗಈԽ
AutoFlow'FBUVSF&YUSBDUJPO$PNQPOFOUT$MBTTJGJDBUJPO$PNQPOFOUT$PODBUFOBUJPO $PNQPOFOUT.PEFM#VJMEFS$PNQPOFOUT3FHJTUSZΫϥελ্ͰϞσϧͷࣗಈߏஙͱϋΠύʔύϥϝʔλͷࣗಈௐΛߦ͏
AutoServing%FQMPZϦϦʔεޙͷਫ਼ࢹɾRe-TrainingɾRe-DeployΛࣗಈͰߦ͏.POJUPSJOH&WBMVBUJPO)ZQFSQBSBNFUFSPQUJNJ[BUJPO3F5SBJOJOH
·ͱΊ• MLʹগ͠௨ৗͱҧ͏Πϯϑϥ͕ඞཁʹͳΔ ˠ·ͩϕετɾϓϥΫςΟε͔Βͳ͍• ͦͦMLͳػೳΛຊ֨ӡ༻͠Α͏ͱ͢Δͱɺେ෯ͳࣗಈԽɾΈԽΛਐΊͳ͍ͱ্ख͘ߦ͔ͳ͍
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠!!