Slide 1

Slide 1 text

Software Engineer at ABEJA Yusuke Ueno ABEJA Platform Ͱͷ ML Ops

Slide 2

Slide 2 text

ࠓ೔࿩͢͜ͱ • ABEJA Platform ͱ͸ʁ • ػցֶशͷ࣮ݧ؅ཧʹ͍ͭͯ • ABEJA Platform Ͱͷ࣮ݧ؅ཧͱͦͷ࣮૷

Slide 3

Slide 3 text

ABEJA Platform ͱ͸ʁ

Slide 4

Slide 4 text

Copyright © 2019 ABEJA, Inc. All rights reserved.

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ن໛ײ

Slide 7

Slide 7 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ML Ops ͱ͸? DevOps ͸͜ͷΑ͏ͳҹ৅ • Development ͱ Operation ؒͷϓϩηεվળ • ΞϓϦέʔγϣϯͷσϦόϦೳྗΛ͋͛ΔจԽత఩ֶɺ ϓϥΫςΟεɺπʔϧ

Slide 8

Slide 8 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ML Ops ͜͏ఆٛͯ͠Έ·͢ • ML Engineer ͱ Development ؒͷϓϩηεվળ • Ϗδωεʹద༻Ͱ͖Δਫ਼౓Λ΋ͭϞσϧΛఏڙ͢Δೳྗ Λ্͛ΔจԽత఩ֶɺϓϥΫςΟεɺπʔϧ

Slide 9

Slide 9 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࠓ೔͸ֶश෦෼ʹ͍ͭͯ

Slide 10

Slide 10 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ֶश ΠςϨʔςΟϒͳ࡞ۀ • ֶशίʔυͷ࡞੒ɾमਖ਼ • ҟͳΔΦϓςΟϚΠβͰͷࢼߦ • ϋΠύʔύϥϝʔλͷௐ੔ • αϯϓϦϯάํ๏ͷमਖ਼ • ҟͳΔόʔδϣϯͷϥΠϒϥϦͷ࢖༻ • ϥϯμϜγʔυͷมߋ

Slide 11

Slide 11 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧͷ؅ཧ͕ॏཁ ҰճҰճͷ࣮ݧͷ৚݅ͱ݁ՌΛه࿥͍ͯ͠ͳ͍ͱɺޙͰਫ਼ ౓͕ྑ͔ͬͨ࣌ͷ࣮ݧΛ࠶ݱͰ͖ͳ͍ ه࿥ͯ͠ɺӾཡͰ͖ΔΑ͏ʹ͓ͯ͘͠ඞཁ͕͋Δ

Slide 12

Slide 12 text

• σʔληοτ • ίʔυ • ύϥϝʔλ • ࣮ߦ؀ڥ • ࣮ݧ݁ՌʢධՁࢦඪʣ • ॏΈύϥϝʔλ • ϩά • ࣮ߦ࣌ؒ ه࿥ • ࣮ݧ݁Ռͷൺֱ • ৄࡉ৘ใͷදࣔ • ࣮ݧ৚݅ • ࣮ݧ݁Ռ • ՄࢹԽʢը૾ͳͲʣ • ϝϯόʔؒͰͷڞ༗ • աڈͷ࣮ݧͷݕࡧ • Ӿཡ Ӿཡ

Slide 13

Slide 13 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧ؅ཧͷશମ૾ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ ϝϯόʔؒͰͷڞ༗ όʔδϣϯ؅ཧ ՄࢹԽ ֶशδϣϒؒͰͷൺֱ

Slide 14

Slide 14 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧ؅ཧͷશମ૾ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ

Slide 15

Slide 15 text

Copyright © 2019 ABEJA, Inc. All rights reserved. σʔληοτͷόʔδϣϯ؅ཧ ̎ͭͷίϯϙʔωϯτΛ༻ҙ • Datalake • ΦϒδΣΫτετϨʔδ • Datasets • Datalake ΦϒδΣΫτ΁ͷࢀর৘ใͱϝλσʔλ

Slide 16

Slide 16 text

Copyright © 2019 ABEJA, Inc. All rights reserved. σʔληοτͷόʔδϣϯ؅ཧ • Annotation Tool ʹͯ Datalake ͷσʔλʹରͯ͠Ξϊςʔ γϣϯͨ݁͠ՌΛ Datasets ͱͯ͠ग़ྗ %BUBMBLF %BUBTFUT

Slide 17

Slide 17 text

Copyright © 2019 ABEJA, Inc. All rights reserved. σʔληοτͷόʔδϣϯ؅ཧ σʔλΛ௥Ճͨ͠৔߹ɺผͷ datasets ͱͯ͠࡞੒Մೳ \^ \^ \^ ɾɾɾ GJMFT BOOPUBUJPOT EBUBTFUT WFSTJPO \^ \^ WFSTJPO

Slide 18

Slide 18 text

Copyright © 2019 ABEJA, Inc. All rights reserved. σʔληοτͷόʔδϣϯ؅ཧ tag Ͱ datasets ಺Λ࿦ཧతʹ෼ׂ͠ಛఆͷཁૉͷΈΛநग़ ɾɾɾ EBUBTFUT UBH" UBH# \^ " \^ " \^ " \^ # \^ #

Slide 19

Slide 19 text

Copyright © 2019 ABEJA, Inc. All rights reserved. σʔληοτͷՄࢹԽ σʔληοτࣗମͷ֬ೝ͕Մೳ

Slide 20

Slide 20 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧ؅ཧͷશମ૾ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ

Slide 21

Slide 21 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ߦ؀ڥ Platform Ͱ Python RuntimeɺओཁͳϑϨʔϜϫʔΫɺϥΠ ϒϥϦશ෦ೖΓͷ Docker Image Λఏڙ

Slide 22

Slide 22 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ֶशίʔυɾύϥϝʔλ • ֶशΛ࣮ߦ͢Δ Python ίʔυ • Platform ্Ͱݺͼग़͞ΕΔؔ਺Λ࣮૷ • Docker Image ʹඞཁͳ Python ϥΠϒϥϦ͕ͳ͍৔߹ʹ ͸ requirements.txt ʹ௥Ճ • ༩͑ͨύϥϝʔλ͸؀ڥม਺ͱͯ͠ίʔυ಺ͰऔಘՄೳ

Slide 23

Slide 23 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࢖༻͢Δσʔληοτɺֶशίʔυɺύϥϝʔλɺ࣮ߦ؀ ڥΛ·ͱΊͯɺ࣮ߦͰ͖Δঢ়ଶͰόʔδϣχϯάͯ͠؅ཧ ֶशδϣϒఆٛόʔδϣϯ ֶशίʔυ { } ύϥϝʔλ σʔληοτ ࣮ߦ؀ڥ

Slide 24

Slide 24 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ֶशδϣϒఆٛόʔδϣϯͱύϥϝʔλɺΠϯελϯελ ΠϓΛࢦఆֶͯ͠शδϣϒΛ࣮ߦ ֶशδϣϒ࣮ߦ ֶशίʔυ { } ύϥϝʔλ ֶशδϣϒఆٛόʔδϣϯ { } ্ॻ͖ύϥϝʔλ ֶशδϣϒ σʔληοτ ΠϯελϯελΠϓ ʴ ه࿥ ࣮ߦ؀ڥ

Slide 25

Slide 25 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧ؅ཧͷશମ૾ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ

Slide 26

Slide 26 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ֶशδϣϒͷ࣮ߦͱ݁Ռͷ؅ཧ • kubernetes ( EKS ) Λ࢖༻ • Ҏલ͸ kubernetes on EC2 • nvidia-device-plugin Λ࢖༻ͯ͠ GPU Λೝࣝ • spotinst ͰΫϥελΦʔτεέʔϦϯά • ָʹෳ਺ͷΠϯελϯεͰͷεέʔϧ͕Մೳ • p2 ܥɺ p3 ܥΠϯελϯε

Slide 27

Slide 27 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧ؅ཧͷશମ૾ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ

Slide 28

Slide 28 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ֶशδϣϒ • k8s ͷ Job ͱֶͯ͠शίʔυʹύϥϝʔλΛ༩࣮͑ͯߦ • SDK Λ࢖༻ͯ͠ɺΤϙοΫ͝ͱͷਫ਼౓Λߋ৽ ΠϯελϯελΠϓ ࣮ߦ؀ڥ 4%, ਫ਼౓Λอଘ ֶशίʔυ { } ύϥϝʔλ ධՁ݁Ռ

Slide 29

Slide 29 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧ؅ཧͷશମ૾ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ

Slide 30

Slide 30 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ؅ཧܥίϯςφ ֶशδϣϒͱಉ͡ϊʔυʹ഑ஔ͠ɺग़ྗͱͳΔ΋ͷΛอଘ &'4Ͱͷڞ༗ϑΝΠϧετϨʔδ ֶशδϣϒ "HFOU 5FOTPS#PBSE 'MVFOUE Ϛ΢ϯτ εςʔλε؂ࢹ ग़ྗϑΝΠϧอଘ ެ։ ϩάΛऔಘ อଘ

Slide 31

Slide 31 text

Copyright © 2019 ABEJA, Inc. All rights reserved. Fluentd ίϯςφ ֶशδϣϒ͕ग़ྗ͢Δඪ४ग़ྗΛอଘ • k8s ͷ DaemonSet ͰίϯςφΛ഑ஔ • શͯͷϊʔυʹ̍ͭͷ Fluentd ίϯςφΛ࣮ߦ • جຊతʹ͸ /var/log/containers/*.log Λ؂ࢹͯ͠ɺ͜ΕΒ ͷϩάΛ֎෦ͷετϨʔδʹอଘ • Pod ͕ফ͑Δͱϩά΋ফ͑ͯ͠·͏

Slide 32

Slide 32 text

Copyright © 2019 ABEJA, Inc. All rights reserved. Fluentd ίϯςφ • RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR ͷઃ ఆ஋࣍ୈͰ͸ɺNoisy Neighbor ʹͳΔ͔ɺResource Limit ʹΑΓ OOM Killer Ͱࡴ͞Εͯ͠·͏

Slide 33

Slide 33 text

Copyright © 2019 ABEJA, Inc. All rights reserved. TensorBoard ίϯςφ ֶशδϣϒ͕ग़ྗ͢ΔΠϕϯτϩάͷՄࢹԽ • Inter-Pod Affinity Λ࢖༻ͯ͠ Job ͱಉ͡ϊʔυʹ഑ஔ • Job ͱಉ͡ϑΝΠϧγεςϜΛϚ΢ϯτ͠ɺϩάΛಡΈ ࠐΈදࣔ • k8s ͷ Service ͷ Node Port Ͱ internal ʹ expose ͠ɺ಺ ੡ͷ Gateway ͕ೝূ෇͖Ͱެ։

Slide 34

Slide 34 text

Copyright © 2019 ABEJA, Inc. All rights reserved. Agent ίϯςφ ֶशδϣϒͷεςʔλε؂ࢹɾ։࢝ / ऴྃ࣌ࠁΛه࿥ • Job ͷεςʔλεΛϙʔϦϯάͯ͠ه࿥ • Job ͱಉ͡ϑΝΠϧγεςϜΛϚ΢ϯτ͠ɺֶशδϣϒ ͷऴྃͱͱ΋ʹग़ྗϑΝΠϧΛอଘ ֶशδϣϒ "HFOU εςʔλε؂ࢹɾߋ৽ ग़ྗϑΝΠϧอଘ

Slide 35

Slide 35 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ࣮ݧ؅ཧͷશମ૾ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ

Slide 36

Slide 36 text

Copyright © 2019 ABEJA, Inc. All rights reserved.

Slide 37

Slide 37 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ML Ops • ML Engineer ͱ Development ؒͷϓϩηεվળ • Ϗδωεʹద༻Ͱ͖Δਫ਼౓Λ΋ͭϞσϧΛఏڙ͢Δೳྗ Λ্͛ΔจԽత఩ֶɺϓϥΫςΟεɺπʔϧ

Slide 38

Slide 38 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ML Engineer ͱ Development ؒͷϓϩηεվળ ཁٻΛຬͨ͢Ϟσϧ͕Ͱ͖ΔͱଞͷαʔϏε͕ར༻Α͏ʹެ։ • ୭͕ຊ൪޲͚ͷίʔυΛॻ͔͘ʁ • Data Scientist ͕ॻ͍ͨίʔυΛॻ͖௚͞ͳ͍ͱ͍͚ͳ͍ • ॻ͖௚͢ͱਫ਼౓͕࠶ݱ͠ͳ͍… • Ϟσϧͷߋ৽͕ଟ͗͢ → αʔϏεͷߋ৽ճ਺૿Ճ • ౳ʑ…

Slide 39

Slide 39 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ML Engineer ͱ Development ؒͷϓϩηεվળ Development ଆ͸ֶश݁Ռͱਪ࿦ίʔυͱ૊Έ߹Θͤͯ όʔδϣϯ؅ཧՄೳ ਪ࿦ίʔυ ֶश݁Ռ ॏΈϑΝΠϧ ࣮ߦ؀ڥ ධՁ݁Ռ ॏΈϑΝΠϧ ࣮ߦ؀ڥ ධՁ݁Ռ δϣϒ̍ δϣϒ̎ ॏΈϑΝΠϧ ࣮ߦ؀ڥ Ϟσϧ

Slide 40

Slide 40 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ML Engineer ͱ Development ؒͷϓϩηεվળ Ϟσϧ͸ͦͷ·· Web API ͱͯ͠ެ։Մೳ Ϟσϧߋ৽࣌΋ Web API Λ҆શʹߋ৽Մೳ ਪ࿦ίʔυ Ϟσϧ ॏΈϑΝΠϧ ࣮ߦ؀ڥ ॏΈϑΝΠϧ ࣮ߦ؀ڥ Ϟσϧ̍ Ϟσϧ̎ ਪ࿦ίʔυ 8FC"1* 8FC"1* σϓϩΠ ΤϯυϙΠϯτ ੾Γସ͑Մೳ

Slide 41

Slide 41 text

Copyright © 2019 ABEJA, Inc. All rights reserved. Platform ͰͷϞσϧ؅ཧશମ { } ֶशίʔυ ύϥϝʔλ ධՁ݁Ռ ॏΈϑΝΠϧ ϩά ࣮ߦ࣌ؒ ֶशδϣϒ σʔληοτ ࣮ߦ؀ڥ ਪ࿦ίʔυ ॏΈϑΝΠϧ ࣮ߦ؀ڥ

Slide 42

Slide 42 text

Copyright © 2019 ABEJA, Inc. All rights reserved. ·ͱΊ • ࣮ݧ؅ཧ͸໘౗͕ͩɺ΍Βͳ͍ͱޙͰࠔΔ • ֶशͷೖྗͱͳΔ࢖༻͢Δσʔληοτɺֶशίʔυɺ ࣮ߦ؀ڥͳͲΛ·ͱΊͯόʔδϣϯ؅ཧ • ग़ྗ݁Ռͷอଘ͸ग़དྷΔ͚ͩ։ൃऀʹෛ୲Λ͔͚ͳ͍ܗ Ͱ Platform ଆͰ࣮૷ • αʔϏεԽ͢ΔϞσϧͱֶशδϣϒͷ݁Ռͷඥ෇͚ͯτ ϨʔαϏϦςΟΛ୲อ