Slide 1

Slide 1 text

ϝϧΧϦͷϚʔέοτ݈શԽ ࢪࡦΛࢧ͑ΔMLج൫ Mercari ML Ops Night Vol.1
 
 hnakagawa


Slide 2

Slide 2 text

ࣗݾ঺հ • Hirofumi Nakagawa (hnakagawa) • 2017೥7݄ೖࣾ • ॴଐ͸SRE • σόΠευϥΠό։ൃ͔Βϑϩϯ τΤϯυ։ൃ·Ͱ΍ΔԿͰ΋԰ • NOT MLΤϯδχΞ • https://github.com/hnakagawa

Slide 3

Slide 3 text

͓࢓ࣄ • ML Platform։ൃ • MLΤϯδχΞͱSREͷεΩϧΪϟοϓΛຒΊ Δ • ML Reliability, SysML?, MLOps? • SREͷཱ৔͔ΒMLγεςϜͷࣗಈԽΛߦ͏

Slide 4

Slide 4 text

ML Platform • ಺੡ͷML Platform • kubernetesϕʔε • ϩʔΧϧ؀ڥͱΫϥελ؀ڥͷ ࠩΛந৅Խ͢Δ • ศརAPI܈ • طଘͷML FrameworkΛ࢖༻͠ ؆୯ʹTraining/ServingΛߦ͏ ؀ڥΛఏڙ

Slide 5

Slide 5 text

ͦͷ͏ͪOSSͰެ։༧ఆ(ଟ෼

Slide 6

Slide 6 text

ࣄྫ ϦΞϧλΠϜ঎඼؂ࢹγεςϜ • ௨শ Lovemachine • ML Platform্ʹ࣮૷͞Ε͍ͯΔ .-1MBUGPSN USBJOJOHDMVTUFS -PWFNBDIJOF ($4 GKE PubSub .-1MBUGPSN TFSWJOHDMVTUFS -PWFNBDIJOF

Slide 7

Slide 7 text

Model Training & Serving
 Workflow

Slide 8

Slide 8 text

.-1MBUGPSN USBJOJOHDMVTUFS Workflow for Production $* .-1MBUGPSN TFSWJOHDMVTUFSGPSUFTU .PEFM3FHJTUSZ +PC +PC ɾɾ 3&45
 "1* 4USFBNJOH 5' 4FSWJOH
 ɾɾɾ

Slide 9

Slide 9 text

.-1MBUGPSN USBJOJOHDMVTUFS Training Workflow $* .PEFM3FHJTUSZ +PC +PC ɾɾɾ 1. GitHub΁ͷpushΛτϦΨʹtrainingΛىಈ 2. Training͞ΕͨModel͸Model Registry
 ΁্͕Δ

Slide 10

Slide 10 text

Serving Workflow .-1MBUGPSN TFSWJOHDMVTUFSGPSUFTU .PEFM3FHJTUSZ ɾɾ 3&45
 "1* 4USFBNJOH 5' 4FSWJOH
 ɾɾɾ 1. Model RegistryΛ؂ࢹͯࣗ͠ಈͰModel ΛServing 2. Serving&Test͕੒ޭ͢Δͱຊ൪༻k8s manifestΛग़ྗ

Slide 11

Slide 11 text

Model Serving APIͷߏ੒ྫ 5FOTPS'MPX
 4FSWJOH 5' .PEFM 5' .PEFM 'MBTL 4,
 .PEFM 4,
 .PEFM 4,
 .PEFM gRPC .FSDBSJ"1* REST FlaskͰલॲཧΛߦ͍
 ཪͷTensorFlow Servingʹ౤͍͛ͯΔ

Slide 12

Slide 12 text

Model Serving API
 Streaming ver ͷߏ੒ྫ 5FOTPS'MPX
 4FSWJOH 5' .PEFM 5' .PEFM .-1MBUGPSN 'SBNFXPSL
 PS
 "QBDIF#FBN
 4,
 .PEFM 4,
 .PEFM 4,
 .PEFM gRPC PubSub

Slide 13

Slide 13 text

TensorFlow Serving • TensorFlow project͕ఏڙͯ͠ ͍ΔServing؀ڥ • PythonॲཧܥΛհͣ͞ʹTFͷ modelΛservingͰ͖Δ • ඪ४ͷ࣮૷Ͱ͸gRPCͰAPIΛ ఏڙ

Slide 14

Slide 14 text

ModelͱίϯςφɾΠϝʔδ • ڊେͳML ModelΛίϯςφɾΠϝʔδʹؚΊ Δ͔൱͔ • ؚΊͳ͍ͷͰ͋Ε͹Կॲʹ഑ஔ͢Δ͔ • ϙʔλϏϦςΟੑͱϩʔυ࣌ؒͷτϨʔυΦϑ • ྑ͍ΞΠσΟΞ͕͋Ε͹ڭ͑ͯԼ͍͞…

Slide 15

Slide 15 text

௨ৗͷAPIͱ͸ҧ͏ • ѻ͏ϦιʔεɺModelαΠζ͕େ͖͘ͳΔ৔ ߹͕ଟ͍(਺ඦMBʙ਺GB) • CPUɾϝϞϦϦιʔεͷফඅ͕ܹ͍͠ • ৔߹ʹΑͬͯ͸GPU΋࢖͏

Slide 16

Slide 16 text

ϝϞϦফඅ໰୊ • LovemachineͷPython࣮૷෦෼͸࣮ߦ࣌ʹ໿ 2GBϝϞϦΛফඅ͢Δˠࠓޙ͞Βʹ૿͑Δ༧ ఆ΋͋Δ • Scikit-learnͰهड़͞ΕͨTF-IDF౳ͷલॲཧ෦ ෼͕େ͖͘ͳΔࣄ͕ଟ͍

Slide 17

Slide 17 text

Pythonͱฒྻੑ • ౰વThread͕࢖͑ͳ͍(GILͷͨΊ) • ϓϩηεຖʹModelΛϩʔυ͢Δͱඞཁͳϝ ϞϦαΠζ͕େ͖͘ͳΔˠ Blue-Green Deployͷো֐ʹͳΔ

Slide 18

Slide 18 text

ਖ਼௚PythonͰͷServing͸
 Πϯϑϥతʹਏ͍ࣄ͕ଟ͍…

Slide 19

Slide 19 text

ϝϞϦΛݡ͘࢖͏ • fork͢ΔલʹmodelΛϩʔυ͠Copy on Write Λޮ͔͢ • k8sͷone process per containerηΦϦ͸͋ ͑ͯഁ͍ͬͯΔ

Slide 20

Slide 20 text

Copy On Writeͷ෮श ϝϞϦ ਌ϓϩηε ࢠϓϩηε 2.fork 1BHF" 1.allocation ಉ͡ྖҬΛࢀর

Slide 21

Slide 21 text

ϓϩηε͕ϝϞϦͷ಺༰Λ
 ॻ͖׵͑Δͱ… ϝϞϦ ਌ϓϩηε ࢠϓϩηε 1BHF" 1BHF# OS͕ผͷྖҬΛAllocationͯ͠ݩσʔλΛίϐʔ͢Δ ผͷྖҬΛࢀর

Slide 22

Slide 22 text

Current Issues • ਓؒͷߦಈΛ૬खʹ͍ͯ͠Δҝɺσʔλͷ܏ ޲͕มΘΓ΍͔ͬͨ͢Γɺ༧૝֎ͷ໰୊͕ൃ ੜͨ͠Γͯ͠ɺରԠ͠ଓ͚Δඞཁ͕͋Δ
 ˠ ML Model࡞੒ऀʹෛ୲ֻ͕͔Γଓ͚Δ
 ˠ SREͱͯ͠͸ࣗಈԽΛؚΜͩ࢓૊ΈͰղܾ ͍ͨ͠

Slide 23

Slide 23 text

In Progress • ࣾ಺ͷσʔλ͔ΒEmbedding͢Δ࣮૷Λίϯ ϙʔωϯτԽ • ಛఆͷ໰୊Λղܾ͢ΔϞσϧߏஙΛ͋Δఔ౓ ࣗಈԽ
 ˠࣾ಺ͷ໰୊ղܾʹಛԽͨ͠ઐ༻ͷAutoMLత ͳԿ͔

Slide 24

Slide 24 text

AutoFlow(Ծ) 'FBUVSF&YUSBDUJPO $PNQPOFOUT $MBTTJpDBUJPO $PNQPOFOUT $PODBUFOBUJPO
 $PNQPOFOUT .PEFM #VJMEFS $PNQPOFOUT 3FHJTUSZ Ϋϥελ্ͰϞσϧͷ൒ࣗಈߏஙͱϋΠύʔύϥ ϝʔλͷࣗಈௐ੔Λߦ͏

Slide 25

Slide 25 text

·ͱΊ • MLʹ͸গ͠௨ৗͱҧ͏Πϯϑϥ͕ඞཁʹͳΔ
 ˠ·ͩϕετɾϓϥΫςΟε͸෼͔Βͳ͍ • ͦ΋ͦ΋MLͳػೳΛຊ֨ӡ༻͠Α͏ͱ͢Δ ͱɺେ෯ͳࣗಈԽɾ࢓૊ΈԽΛਐΊͳ͍ͱ্ ख͘ߦ͔ͳ͍

Slide 26

Slide 26 text

͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠!!

Slide 27

Slide 27 text

We are Hiring!!

Slide 28

Slide 28 text

SRE ML Reliability • SysML? MLOps? ৽͍͠Job description • SREεΩϧ+ML෼໺ͷجૅ஌ࣝ • MLΠϯϑϥͷࣗಈԽɾ࢓૊ΈԽΛਪ͠ਐΊͯ ͘ΕΔਓࡐ • ΋ͪΖΜଞͷ৬छ΋ઈࢍืूத!!

Slide 29

Slide 29 text

ৄࡉ͸ͪ͜Β
 https://careers.mercari.com/