Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Research Paper Introduction #102 "MLaaS in the ...

Research Paper Introduction #102 "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters"

cafenero_777

July 09, 2022
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #38 “MLaaS in the Wild: Workload Analysis

    and Scheduling in Large-Scale Heterogeneous GPU Clusters” ௨ࢉ#102 @cafenero_777 2022/07/07 1
  2. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Background 3. Workload Characterization

    4. GPU Machine Utilization 5. Opportunities for Cluster Management 6. Challenges of Scheduling 7. Discussion 8. Related Work 9. Conclusion 2
  3. ର৅࿦จ •MLaaS in the Wild: Workload Analysis and Scheduling in

    Large-Scale Heterogeneous GPU Clusters • Qizhen Weng†* , et al • †Hong Kong University of Science and Technology, *Alibaba Group
 • NSDI 2022 • https://www.usenix.org/conference/nsdi22/presentation/weng 3
  4. 1. Introduction ʢશൠʣ •MLͷେ੒ޭ: ݴޠॲཧɾը૾෼ྨɾԻ੠ೝࣝɾਪન, etc • େن໛GPUΫϥελͰMLaaSͯ͠MLϫʔΫϩʔυΛࡹ͘ඞཁੑ • Alibaba

    PAI (Platform for AI)ͷ6742 GPUsͷ2ϲ݄ؒͷϫʔΫϩʔυΛ෼ੳ • ϫʔΫϩʔυɾMLϑϨʔϜϫʔΫɾHW͕༷ʑ • GPUϩʔΧϦςΟ΍εέδϡʔϦϯάɺHW(V100/P100/T4, CPU, memory)ͷཁ͕݅શવҧ͏ • ར༻཰޲্ͱδϣϒ׬ྃ࣌ؒ୹ॖͷཱ྆Λ໨ࢦ͍ͨ͠ 5
  5. 1. Introduction ʢ՝୊ͱղܾͷ঺հʣ 1. GPU෼ׂར༻Ͱ͸ར༻཰͕௿Լʁ • λεΫΛGPUͷҰ෦ʹׂ౰ʢnot GPU·Δ͝ͱׂ౰ʣ • GPUڞ༗ʢ࣌ؒଟॏԽʣͰGPUͷ50%͔͠࢖Θͳ͍

    6 2. ୹λεΫͷQ delay͕ݦஶʁ • ࣄલਪఆͰεέδϡʔϦϯάΛʢ௕λεΫΑΓʣ༏ઌ • λεΫͷ൒਺Ҏ্͸܁Γฦ͞ΕΔ -> ༧ଌ͠΍͍͢->63%Ҏ্ͷ୹ॖ 3. ߴGPUλεΫͷׂ౰εέδϡʔϦϯά͕ࠔ೉ʁ • λεΫʢϑϧGPU+NVLinkཁ݅ʣʹ߹ΘͤΔඞཁ͋Γ • ϋΠΤϯυλεΫ͸ϋΠΤϯυϚγϯʹ༧໿ɾsched 5. ΫϥελશମͷHWϦιʔεͷ࢖༻཰ͷόϥϯε͕ѱ͍ʁ • Usage: ϩʔΤϯυ( >70%) V.S. ϋΠΤϯυ(CPU35%/GPU49%) • CPU demand • 8GPU -> ੵΜͰΔCPUͷഒ΄͍͠ʂ
 2GPU -> ੵΜͰΔCPUͷ൒෼Ͱे෼ɾɾ 4. ʢGPUΫϥελͳͷʹʣCPU͕ϘτϧωοΫʹͳΓ͕ͪʁ • ֶशɾਪ࿦͸GPU͕ͩɺσʔλॲཧʢಛ௃நग़ɾαϯϓϦϯά, etcʣ͸ओʹCPU • CPU usage͕ϝΠϯͷλεΫ΋ଘࡏ͢ΔʢGPUʣ
  6. 2. Background •σʔλര૿ͱGPUधཁ • σʔλྔ͕100GB -> ~100TB, gang-scheduling, 1 ML

    job on 1000+ GPUs •Alibaba PAI • ༷ʑͳMLύΠϓϥΠϯͱϑϨʔϜϫʔΫΛαϙʔτ • Ϣʔβ͸δϣϒͱϦιʔεΛࢦఆ •࣮ࡍͷτϨʔεΛ༻͍ͯղੳ • طଘݚڀͰ͸ϫʔΫϩʔυɾHW͕ݶఆ͞Ε͍ͯΔ 7 ಉ࣌ʹ࢖͏ΠϯελϯεΛେྔʹׂ౰
  7. 3. Workload Characterization (1/3) τϨʔε֓ཁฤ •τϨʔε৘ใ • ࣌ؒɾར༻HWϦιʔεɾར༻ίϯςφɾruntime info (kernel,

    GPU driver, etc)͔Βਪఆ • ʢΞϓϦͷத਎͸ݟͳ͍ɾݟͯ͸͍͚ͳ͍ɺͱ͍͏public cloudଆͷཱ৔ʣ •Jobs/Tasks/Instances • Ϣʔβ͕1δϣϒΛ౤ೖɺ1~ෳ਺λεΫɺ->1~ෳ਺Πϯελϯεʢίϯςφʣ͕࣮ߦ •Ϧιʔεར༻ͷ࿪Έ • ্Ґ5%Ϣʔβ͕77%ʢ17k+ʣͷΠϯελϯεΛ౤ೖ • 80%͸gang-shed, 20%͸100+GPUඞཁ • ҰํͰɺλεΫຖͷGPUϩʔΧϦςΟ͸ඞཁ 8 େྔͷΠϯελϯε਺Λ࢖͏Ϣʔβ΋ଟ͍
  8. 3. Workload Characterization (2/3) ࣌ؒతͳύλʔϯ ར༻͸ฏ೔͕ൺֱతෛՙߴΊɺਂ໷͸௿Ί
 ʢ~=ਓ͕JobΛΩοΫ͍ͯ͠Δʣ 9 task౤ೖ͔ΒΠϯελϯε࣮ߦ·Ͱͷqueue࣌ؒʢsched࣌ؒʣ ࣮ߦ࣌ؒʹ෯͕͋Δ

    (༷ʑͳWLࠞࡏ) Short taskͷ9%͸ ࣮ߦ࣌ؒͷ൒෼͸εέδϡʔϧ଴ͪ shared GPUͷׂ౰͸଎͍ ઐ༗GPUׂ౰͸஗͍ ϋΠΤϯυGPU΄Ͳׂ౰஗͍
  9. 3. Workload Characterization (3/3) ۭؒతͳύλʔϯ •Ϧιʔεཁٻ͕long tail • 20%ͷΠϯελϯε͕େྔͷϦιʔεΛཁٻ •

    P95ͩͱ 12vCPU/1GPU/59GiBΛཁٻ • P50ͷཁٻϦιʔεͷ໿2ഒ •࣮ࡍʹ࢖ΘΕΔϦιʔε͸ʢཁٻͱൺֱͯ͠ʣগͳ͍ • ࣮ࡍʹ࢖ΘΕΔϦιʔεͷP50͸1.4vCPU/0.042GPU/3.5GiB • CPUڝ߹ͰGPU΄ͱΜͲ”࢖͑ͯ”͍ͳ͍ঢ়ଶ (Fig-5b) •usage - requestͰݟΔͱCPU͕ʢա৒ʹʣ࢖ΘΕΔ܏޲͕ݟ͑Δ 10
  10. 4. GPU Machine Utilization •ϝϞϦΑΓCPU, GPUͷ࢖༻཰͕ߴ͍ • P90-8GPUͷฏۉ࢖༻཰͸82%(CPU), 77%(GPU) •

    P90-2GPUͷฏۉ࢖༻཰͸42%(CPU), 77%(GPU) • ͍ͣΕ΋ฏۉϝϞϦ࢖༻཰͸60%ఔ౓ •Ϧιʔε࢖༻཰ͷ͹Β͖ͭ • GPUͷํ͕͹Β͖͕ͭେ͖͍ʢP90 or P50ʣ • ϫʔΫϩʔυͷҧ͍ʹΑΔʢόʔετతʹGPU͕࢖ΘΕΔͨΊʣ •NW: ଳҬอূ͞Εͨ෼ͷ൒෼ҎԼ͔͠࢖ΘΕͳ͍ •CPU: iowait(I/O)Ͱ͸ͳ͘୯७ʹσʔλॲཧʹ࢖ΘΕΔ 11
  11. 5. Opportunities for Cluster Management (1/2) GPU Sharing •Ϧιʔεͷߴར༻཰ͱλεΫ׬ྃ࣌ؒͷߴ଎ԽΛ໨ࢦ͢ •GPUڞ༗

    • GPUڞ༗͸Ϧιʔεઅ໿ʹߩݙ • GPUϝϞϦ͸෼ׂͰ͖Δ͕SM͸෼ׂͰ͖ͳ͍ͷͰ࣌෼ׂ • SMڝ߹ͷͨΊੑೳग़͍ͤͯͳ͍ʁ • ࣮ࡍʹGPUڞ༗͞ΕΔέʔε͸ك • Ϋϥελ಺ͰͷਂࠁͳSMڝ߹͸ى͖ͳ͍ʢϋζʣ 12 ߴෛՙGPU͸શମͷ4.5%~7%ఔ౓ ʢ͔͠΋ɺͦͷ͏ͪGPUڞ༗͞Ε͍ͯΔΠϯελϯε͸5%ʣ w/ or w/o GPUڞ༗ͷൺֱ
  12. 5. Opportunities for Cluster Management (2/2) Predictable Duration for Recurring

    Tasks •λεΫ࣮ߦ࣌ؒΛਪఆͰ͖Ε͹ޮ཰తʹεέδϡʔϦϯά͠΍͍͢ • λεΫʢֶश΋ਪ࿦ʣ͸൓෮͞ΕΔͷͰਪఆ͠΍͍͢ • ࣮ࡍʹ࣌ؒਪఆΛߦͬͨͷ͕Fig-12 • task-groupϝλσʔλ͕ඇৗʹ༗ޮɻ • ͜ΕΛݩʹεέδϡʔϦϯάΛγϛϡʔϨʔτʢFig-13ʣ • FIFOΑΓSJF (Short Job First)ͩͱ׬͕ྃ࣌ؒ୹͍ • ෳ਺ͷtask-groupΛ࢖͏ͱ༗ޮ 13 গͳ͘ͱ΋൒෼Ҏ্ͷλεΫ͸5ճҎ্൓෮͞ΕΔ ϝλσʔλʢͷhashʣ͔ΒλεΫ෼ྨ ࣮ߦ࣌ؒͷਪఆͷޡࠩ ࣮ߦ࣌ؒਪఆΛར༻ͨ͠εέδϡʔϦϯάΛ࢖ͬͨ৔߹ͷλεΫ׬ྃ࣌ؒ
  13. 6. Challenges of Scheduling (2/6) High-GPU task •ϏδωεΫϦςΟΧϧͳཁ݅ • େ༰ྔGPUϝϞϦ΍NVLink͕࢖ΘΕΔ܏޲

    • NLP: ଟ͕͘1GPUҎ্ཁٻɺ0.4GPU࢖͏ • Image classi fi cation: େن໛શ݁߹૚ʢ࠷ऴ૚ͷग़ྗ͕100k͋ΔResNet-100kʣ༗Γ • ը૾͔Β঎඼෼ྨͤ͞ΔϢʔεέʔε • GPU locality (NVLink)͕༗ޮ 15
  14. 6. Challenges of Scheduling (3/6) Low-GPU task 16 CRT༧ଌϞσϧͷֶशɾਪ࿦ Πϯελϯεͷ25%ֶ͕श(GPU)ɺ75%͕ਪ࿦(CPU)

    DeepFM, DCN, DNN CPUΛΊͬͪΌ࢖͏-> CPUϊΠδʔωΠόʔʹͳΓ͕ͪ GNN: Graph Neural Network ൺֱతCPUدΓͷॲཧɻϞσϧʹΑͬͯ͸GPU΋࢖͏ RF (ڧԽֶश)ɿ~1k ΠϯελϯεͰCPU, NW࢖͏ɻGPU࢖Θͳ͍
  15. 6. Challenges of Scheduling (4/6) Deployed Scheduling Policies •Reserving-and-Packing: λεΫ಺༰ʹԠͯ͡࢖͏GPUΛࣄલʹܾΊ͓ͯ͘

    • λεΫ͸ฒྻ౓ɾMLϞσϧʢͷਪఆʣɾαΠζͳͲͰ෼ྨ • High-GPU: V100 w/ NVLink • Low-GPU: T4ͳͲ • High-GPUλεΫΛ༏ઌͯ͠εέδϡʔϦϯάͤ͞Δ • αʔόׂ౰͸෼ࢄͤͨ͞΄͏͕ྑ͍ʁ • High-GPUλεΫ͸R&Pͨ͠ํ͕଴͕ͪ࣌ؒݮΔʢ͕ɺLow-GPUλεΫ͸ͦͷٯʣ 17
  16. 6. Challenges of Scheduling (5/6) Open Challenges •࢖͑ΔCPU per GPU਺ͱ”࢖͍͍ͨ”CPU

    per GPU਺͸ҧ͏ • 8GPUͰ͸CPU͕଍Βͳ͍ • 2GPUϚγϯͰ͸CPU͕༨Γʢ~=GPU͕଍Βͣʣ • all nodesͰݟΔͱspec, requets͸ಉ౳ͳͷͰ
 εέδϡʔϦϯάվળͰภΓܰݮ͕ظ଴ʁ •ऑ͍GPUϚγϯ͸ࠞΜͰΔ: εέδϡʔϧ͞Ε͕ͪ •ڧ͍GPUϚγϯ͸ۭ͍ͯΔ: ར༻཰͕ѱ͍ 18 ԣઢ͕1Πϯελϯε
  17. 6. Challenges of Scheduling (6/6) Open Challenges •CPU͕ϘτϧωοΫٙ࿭ɾɾɾ • ͕͔͔࣌ؒΔλεΫ͕ΑΓଟ͘CPUΛ࢖͏܏޲

    • GPUͰ͸͜ͷ܏޲ͳ͠ • CRTλεΫͰݟͯ΋ಉ͡ࣄ͕෼͔Δ • CPUڝ߹ͱλεΫ࣮ߦ࣌ؒʹ૬ؔ͋Γ • ͍ΘΏΔΦʔόʔίϛοτ΍ϊΠδʔωΠόʔঢ়ଶʁ • CPU/GPU/ϝϞϦ/IO/NWͳͲෳ਺ߟྀsched͕ඞཁʁ 19
  18. 8. Related work •GPUڞ༗ • NVIDIA MIG (Multi Instance GPU)ͷར༻ɻA100

    GPUͰͷΈར༻Մೳ • CUDA APIͰॲཧΛ࣌෼ׂ -> ίϯςΩετεΠονΦʔόʔϔου • MPS (NVIDIA Multi-Process Service): ಉ࣮࣌ߦதͷϓϩηεؒো֐ΛisolateͰ͖ͳ͍ • ϑϨʔϜϫʔΫଆͰରԠɿFWʹ߹Θͤͯॻ͖௚͠ɻֶशͷΈαϙʔτͳͲɻ •GPUεέδϡʔϥ͸৭ʑग़͖͍ͯͯΔ͕ɺGPUڞ༗͸αϙʔτ͞Εͳ͍ɻ •NW/IO: େن໛σʔλͷಡΈࠐΈɻ௨৴ΦʔόʔϔουͷܰݮରࡦʢRing AllReduceʣͳͲ 21
  19. 9. Conclusion •Alibaba PAI GPUΫϥελͷ2ϲ݄ͷ࣮ՔಇτϨʔεσʔλΛ༻͍ͯಛ௃෼ੳ • େ൒ͷλεΫ͸େྔͷΠϯελϯεΛ࢖͏ʢGang-schedulingʣ • গ਺ͷΫϦςΟΧϧͳλεΫ͸ϋΠΤϯυGPU w/

    NVLinkΛ࢖͏ • CPUϘτϧωοΫʹͳΓ͕ͪ •εέδϡʔϥͷ޻෉ • GPUڞ༗ɺλεΫ෼ྨʢreserving-and-packingʣɺෛՙͷෆۉߧվળ 22