Ref: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf “ As the machine learning (ML) community continues to accumulate years of experience with live systems ” “ ։ൃ͓ΑͼMLγεςϜΛಋೖ͢Δ͜ͱൺֱతߴͰ҆ՁͰ͕͢ɺ࣌ؒΛ͔͚ͯ ͦΕΛҡ࣋͢Δ͜ͱࠔ͔ͭߴՁͰ͋Δ”
ୈࡾੈΞʔΩςΫνϟ of Datalake Good • Signed URLෆཁʹͳΓUXվળ • ϝλσʔλݕࡧͰ͖ΔΑ͏ʹ Bad • αʔόϨεແ͘ͳͬͨͷͰϝϯ ςίετ૿Ճ • ϝλσʔλػೳ͕ࣗ༝ա͗ͯIndexංେԽ Ͱਏ͍ όοΫΤϯυෛՙɺCWLogsίετ૿ՃʹΑΓ ECSʹมߋ
ୈೋੈΞʔΩςΫνϟ of Training Good • Jupyter NotebookɺTensorboradΛఏڙ • ڞ༗ετϨʔδͷఏڙ Bad • Kubernetes on EC2 ӡ༻͕݁ߏਏ͍ • Jupyterະ༻࣌ͷՔಇίετ͕ແବ • ڞ༗ετϨʔδ͕ߴͯ͘NFSͳͷͰ͍
ୈࡾੈΞʔΩςΫνϟ of Training • ݱঢ়ͷ՝ • ίϯςφؒͷґଘؔ • ϩάΛ࿙Εͳ͘ऩू͢ΔͨΊʹ Affinity Λۦͯ͠ log collector pod -> platform agent pod -> training job ͱ͍͏༏ઌॱҐΛ͚ͯPodΛىಈͤ͞Δͱ ɺඞཁͳϦιʔε͕Γͳ͍EventͷൃՐ͕ΕΔͨΊ Autoscaler ͷ௨͕ΕɺΠϯελϯεͷىಈ͕͘ͳΔ • DockerΠϝʔδɾύοέʔδͷޓੑɺαϙʔτ • ఏڙ͢ΔDockerΠϝʔδͷޓੑҡ͕͍࣋͠ • αϙʔτର • DLϥΠϒϥϦͷछྨ x όʔδϣϯ x Pythonόʔδϣϯ x CUDAͷόʔδϣϯ …
ୈೋੈΞʔΩςΫνϟ of Logging for Customer Good • ϑϧϚωʔδυͳͷͰӡ༻ϑϦʔ Bad • ͓͕͔͔ۚΔɻͱ͍͑ɺCWLogsΑΓ҆͘ ࣗલͰӡ༻͢ΔΑΓϚγ • Datadogͷ༷ʹҾͬுΒΕΔ • ݁Ռ߹ɺॱংอূແ͠ • datadog-agentͷڍಈɾ༷
ୈҰੈΞʔΩςΫνϟ of Logging for System & Application Good • Πϯϑϥͷ͜ͱߟ͑ͳͯ͘ྑ͍ Bad • Ͳ͜ʹ֨ೲ͞Ε͍ͯΔ͔͔Βͳ͍ • ݕࡧੑօແɻgrepྗ্͕Δ • ϚΠΫϩαʔϏεؒͷϩάௐࠪͱ͔͔ͳΓπ ϥϛ͔͠ͳ͍ • ݁Ռɺ͔͋ͬͨ࣌͠ϩάΛݟͳ͍ • Ҏ֎ʹCWLogsߴ͍