Slide 1

Slide 1 text

ϝϧΧϦͷMLج൫ MLCT vol.5
 
 hnakagawa


Slide 2

Slide 2 text

ࣗݾ঺հ • Hirofumi Nakagawa (hnakagawa) • 2017೥7݄ೖࣾ • ॴଐ͸SRE • σόΠευϥΠό։ൃ͔Βϑϩϯ τΤϯυ։ൃ·Ͱ΍ΔԿͰ΋԰ • NOT σʔλαΠΤϯςΟετ • https://github.com/hnakagawa

Slide 3

Slide 3 text

͓࢓ࣄ • ML Platform։ൃ • σʔλαΠΤϯςΟετͱSREͷεΩϧΪϟο ϓΛຒΊΔ • ML Reliability, SysML?, MLOps? • SREͷཱ৔͔ΒMLγεςϜͷࣗಈԽΛߦ͏

Slide 4

Slide 4 text

ML Platform • ಺੡ͷML Platform • kubernetesϕʔε • طଘͷML FrameworkΛ࢖༻͠ ؆୯ʹTraining/ServingΛߦ͏ ؀ڥΛఏڙ

Slide 5

Slide 5 text

ͦͷ͏ͪOSSͰެ։༧ఆ(ଟ෼

Slide 6

Slide 6 text

ϝϧΧϦͷMLར༻ࣄྫ • ײಈग़඼ • ҧ൓ग़඼ݕ஌ • Ձ֨αδΣετ • ΢ΤΠταδΣετ
 ౳ʑ… ̍೔਺ઍສpredictionΛߦ͍ͬͯΔ

Slide 7

Slide 7 text

ML Platform Architecture ,VCFSOFUFT $POUSPMMFS $-* $MVTUFS8PSLGMPX %BTICPBSE 4UPSBHF(BUFXBZ .FUSJDT 3VOOFS $PNQPOFOU .FSDBSJ.- $PNQPOFOU &YUFSOBM .JEEMFXBSF

Slide 8

Slide 8 text

Model Training & Serving
 Workflow

Slide 9

Slide 9 text

.-1MBUGPSN USBJOJOHDMVTUFS Workflow for Production $* .-1MBUGPSN TFSWJOHDMVTUFSGPSUFTU .PEFM3FHJTUSZ +PC +PC ɾɾ 3&45
 "1* 4USFBNJOH 5'4FSW JOH ɾɾɾ

Slide 10

Slide 10 text

.-1MBUGPSN USBJOJOHDMVTUFS Training Workflow $* .PEFM3FHJTUSZ +PC +PC ɾɾɾ 1. GitHub΁ͷpushΛτϦΨʹtrainingΛىಈ 2. Training͞ΕͨModel͸Model Registry
 ΁্͕Δ

Slide 11

Slide 11 text

Serving Workflow .-1MBUGPSN TFSWJOHDMVTUFSGPSUFTU .PEFM3FHJTUSZ ɾɾ 3&45
 "1* 4USFBNJOH 5' 4FSWJOH 1. Model RegistryΛ؂ࢹͯࣗ͠ಈͰModel ΛServing 2. Serving&Test͕੒ޭ͢Δͱຊ൪༻k8s manifestΛग़ྗ

Slide 12

Slide 12 text

Container Workflow %BUB4PVSDF
 *NBHF 5FYUɹ 1SFQSPDFT TJOH *NBHF &TUJNBUPS
 *NBHF 17 17 1JDUVSF 1SFQSPDFT TJOH *NBHF 17 It’s own implementation

Slide 13

Slide 13 text

Model Serving APIͷߏ੒ྫ 5FOTPS'MPX
 4FSWJOH 5' .PEFM 5' .PEFM 'MBTL 4,
 .PEFM 4,
 .PEFM 4,
 .PEFM gRPC .FSDBSJ"1* REST FlaskͰલॲཧΛߦ͍
 ཪͷTensorFlow Servingʹ౤͍͛ͯΔ

Slide 14

Slide 14 text

Model Serving API
 Streaming ver ͷߏ੒ྫ 5FOTPS'MPX
 4FSWJOH 5' .PEFM 5' .PEFM .-1MBUGPSN 'SBNFXPSL
 PS
 "QBDIF#FBN
 4,
 .PEFM 4,
 .PEFM 4,
 .PEFM gRPC PubSub

Slide 15

Slide 15 text

ModelͱίϯςφɾΠϝʔδ • ڊେͳML ModelΛίϯςφɾΠϝʔδʹؚΊ Δ͔൱͔ • ؚΊͳ͍ͷͰ͋Ε͹Կॲʹ഑ஔ͢Δ͔ • ϙʔλϏϦςΟੑͱϩʔυ࣌ؒͷτϨʔυΦϑ • ྑ͍ΞΠσΟΞ͕͋Ε͹ڭ͑ͯԼ͍͞…

Slide 16

Slide 16 text

௨ৗͷAPIͱಛੑ͕ҧ͏ • ѻ͏ϦιʔεɺModelαΠζ͕େ͖͘ͳΔ৔ ߹͕ଟ͍(਺ඦMBʙ਺GB) • CPUɾϝϞϦϦιʔεͷফඅ͕ܹ͍͠ • ৔߹ʹΑͬͯ͸GPU΋࢖͏

Slide 17

Slide 17 text

ϝϞϦফඅ໰୊ • ҧ൓ݕ஌γεςϜͷPython࣮૷෦෼͸࣮ߦ࣌ ʹ໿2GBϝϞϦΛফඅ͢Δˠࠓޙ͞Βʹ૿͑ Δ༧ఆ΋͋Δ • Scikit-learnͰهड़͞Εͨલॲཧ෦෼͕େ͖͘ ͳΓ͕ͪ

Slide 18

Slide 18 text

Pythonͱฒྻੑ • ౰વThread͕࢖͑ͳ͍(GILͷͨΊ) • ϓϩηεຖʹModelΛϩʔυ͢Δͱඞཁͳϝ ϞϦαΠζ͕େ͖͘ͳΔˠ Blue-Green Deployͷো֐ʹͳΔ

Slide 19

Slide 19 text

ਖ਼௚PythonͰͷServing͸
 Πϯϑϥతʹਏ͍ࣄ͕ଟ͍…

Slide 20

Slide 20 text

ϝϞϦΛݡ͘࢖͏ • fork͢ΔલʹmodelΛϩʔυ͠Copy on Write Λޮ͔͢ • k8sͷone process per containerηΦϦ͸͋ ͑ͯഁ͍ͬͯΔ

Slide 21

Slide 21 text

Copy On Writeͷ෮श ϝϞϦ ਌ϓϩηε ࢠϓϩηε 2.fork 1BHF" 1.allocation ಉ͡ྖҬΛࢀর

Slide 22

Slide 22 text

ϓϩηε͕ϝϞϦͷ಺༰Λ
 ॻ͖׵͑Δͱ… ϝϞϦ ਌ϓϩηε ࢠϓϩηε 1BHF" 1BHF# OS͕ผͷྖҬΛAllocationͯ͠ݩσʔλΛίϐʔ͢Δ ผͷྖҬΛࢀর

Slide 23

Slide 23 text

Current Issues

Slide 24

Slide 24 text

ߴ౓ͳܧଓతϝϯςφϯε͕ඞཁ • MLػೳ͸σʔλͷ܏޲͕มΘͬͨΓɺ༧૝֎ ͷ໰୊͕ൃੜͨ͠Γͯ͠ɺͦΕΒʹରԠ͠ଓ ͚Δඞཁ͕͋Δ MLػೳ͸ϦϦʔεޙ΋େ͖ͳ ίετ͕͔͔Γଓ͚Δ

Slide 25

Slide 25 text

େ෯ͳࣗಈԽ͕ඞਢ

Slide 26

Slide 26 text

In Progress

Slide 27

Slide 27 text

ߴ౓ͳࣗಈԽ • ࣾ಺ͷσʔλ͔ΒFeature Extraction͢Δ࣮૷ ΛίϯϙʔωϯτԽ • ಛఆͷ໰୊Λղܾ͢ΔϞσϧߏஙΛ͋Δఔ౓ ࣗಈԽ • ϦϦʔεޙͷRe-TrainingɺHyper parameter optimizationɺDeploy౳ΛࣗಈԽ

Slide 28

Slide 28 text

AutoFlow 'FBUVSF&YUSBDUJPO $PNQPOFOUT $MBTTJGJDBUJPO $PNQPOFOUT $PODBUFOBUJPO
 $PNQPOFOUT .PEFM #VJMEFS $PNQPOFOUT 3FHJTUSZ Ϋϥελ্ͰϞσϧͷ൒ࣗಈߏஙͱϋΠύʔύϥ ϝʔλͷࣗಈௐ੔Λߦ͏

Slide 29

Slide 29 text

AutoServing %FQMPZ ϦϦʔεޙͷਫ਼౓؂ࢹɾRe-TrainingɾRe-Deploy౳ ΛࣗಈͰߦ͏ .POJUPSJOH &WBMVBUJPO )ZQFS QBSBNFUFS PQUJNJ[BUJPO 3F5SBJOJOH

Slide 30

Slide 30 text

·ͱΊ • MLʹ͸গ͠௨ৗͱҧ͏Πϯϑϥ͕ඞཁʹͳΔ
 ˠ·ͩϕετɾϓϥΫςΟε͸෼͔Βͳ͍ • ͦ΋ͦ΋MLͳػೳΛຊ֨ӡ༻͠Α͏ͱ͢Δ ͱɺେ෯ͳࣗಈԽɾ࢓૊ΈԽΛਐΊͳ͍ͱ্ ख͘ߦ͔ͳ͍

Slide 31

Slide 31 text

͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠!!