Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#44 “Predictable vFabric on Informative Data Pl...

#44 “Predictable vFabric on Informative Data Plane”

cafenero_777

January 29, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. MOTIVATION 3. DESIGN 4.

    IMPLEMENTATION 5. EVALUATION 6. DISCUSSION 7. RELATED WORK 8. CONCLUSION 2
  2. ର৅࿦จ •Predictable vFabric on Informative Data Plane • Shuai Wang,

    et al • Tsinghua University, Alibaba Group, Zhongguancun Laboratory, University of Cambridge • ACM SIGCOMM ’22 • https://conferences.sigcomm.org/sigcomm/2022/program.html • https://dl.acm.org/doi/abs/10.1145/3544216.3544241 • https://www.youtube.com/watch?v=93-p-x2cdQo&ab_channel=ACMSIGCOMM 3
  3. 1. INTRODUCTION •ϚϧνςφϯτDC • VM΍VF (Virtual NW Fabric = ֤ςφϯτͷԾ૝NW)Ͱ෼཭

    • ଳҬอূɺಈతଳҬׂ౰ɺtail latency͕՝୊ • ྫɿAWS EBS: 100ms, 1ms (tail)͕ඞཁ • ྫɿ෼ࢄػցֶश: ෛՙ͕compute͔ΒNW΁Ҡߦɺ10msͰԠ౴͕ඞཁ •E2EଳҬอূ: ར༻ޮ཰ύΠϓ (src/dst path)Ͱ෼ࢄ != ଳҬอূύΠϓɻ୯७ʹલऀʹׂΓ౰ͯΔͱrate͕શ෦௿͘ͳͬͯ͠·͏ɻ •ώϡʔϦεςΟοΫʹܾΊͯ͠·͏ͱɺ͖Ίࡉ͔͍ಈత੍ޚ͕Ͱ͖ͳ͍-> programmable XͰ΍Ζ͏ • μFAB: programmable NIC/SwitchͰଳҬอূͷ࣮ݱͱಈతଳҬׂ౰ͷਝ଎Խɺtail latency࡟ݮ • Core (Switch)͔ΒINTͰEdge (NIC)ʹ৘ใఏڙ͠ɺNICଆͰύεબ୒΍ଳҬ੍ޚ • ՝୊: τϥϑΟοΫͷٸมɻಉظతͳburst/incast ௨৴ͷڐ༰ɻଳҬ΍ܦ࿏ϑϥοϓରࡦɻ 5
  4. 2. MOTIVATION (1/2) Practical challenges for predictable VFs •DCN͸ϕετΤϑΥʔτ͕ଟ͍ •

    ௿Ίͷ࢖༻཰΍ECMP෼ࢄɺshort burst -> tail latency •࣮ྫFig1, 2: SWͰ٧·Δ -> PCIeͰ٧·Δ->NIC Qʹཷ·Δ •࣮ྫFig3: ECMP Hash polarization໰୊ 6 ฏۉଳҬ͸໰୊ͳ͍͕ɺtail latency͸50ഒ͙Β͍૿͑Δ SA: Storage Agent -> BA: Block (Replica) Agent -> CS: Chunk Server <- GC: Garbage Collection (Merge) service ฏۉଳҬ͸҆ఆͯ͠Δ͕ɺtail latency͸10ഒ͙Β͍ͰΔ ϋογϡ෼ۃ໰୊ (Hash polarization) ্ԼҐSwitchͷhashΞϧΰϦζϜ͕ಉͨ͡Ίɺ ֤uplinkϙʔτ܈ͷଳҬ͕ภͬͯ͠·͏ɻ
  5. 2. MOTIVATION (2/2) State-of-the-art solutions cannot help out •PicNIC: edgeଳҬอূ͸Ͱ͖Δ͕ɺfabric͸είʔϓ֎

    •Seawall: Weighted-CCͰNWશମͷଳҬׂ౰͢Δ͕ɺશମಉظ͕஗͍(਺ेms) •Clove: pathͷଳҬʹԠͯ͡ܦ࿏ΛܾΊΔɻςφϯτ৘ใ͸ͳ͍ͷͰͿ͔ͭΔՄೳੑ •IncaseγφϦΦͰ࣮ݧ • ҟͳΔςφϯτ͔ΒಉҰϗετ΁500Mbpsྲྀ͢ -> tail latencyѱԽ • ଳҬ࠷దԽ͸஗Ԇ૿Ճ͠΍͍͢ • ଳҬอূ΋ݫ͍͠ •ղܾࡦɿedgeͱcoreͷ౷߹ (μFAB) • core(εΠον)͔Βedge(ϗετ)΁৘ใΛ఻ൖ • edgeଆͰૹ৴੍ޚ͢Δ 7
  6. 3. DESIGN Design goals and assumptions •Model: full-bisection fabric •

    VMؒͷGuarantee Partitioning (GP)Λ༻͍Δ •Design goals • ࠷খଳҬͷอূ: VF͸vNICʹରͯ͠࠷௿ଳҬΛਝ଎ʹఏڙ • ଳҬಈత֬อ: ଳҬ͕͋Ε͹࠷௿ଳҬҎ্ΛvNICʹఏڙ • tail-latency཈੍: όʔετ௨৴ʹରͯ͠vNICؒE2E஗ԆΛ཈੍ •Ծఆ͢Δ͜ͱ • Programmable switchΛ࢖͏͜ͱɻ·ͨɺDCNτϙϩδ͸ط஌ͱͯ͠ɺࣗ໌ͳ࠷௿ଳҬͷ࣮ݱղ͕͋Δ͜ͱɻ •Ծఆ͠ͳ͍͜ͱ • congestion-free network core͸૝ఆ͠ͳ͍ɻSingle/multi pathͰ΋Φʔόʔαϒͯ͠΋ྑ͍ɻ • ׬શͳෛՙ෼ࢄʢhashʹ͸ภΓ͋Γʣ͸૝ఆ͠ͳ͍ɻಛఆͷτϥϑΟοΫύλʔϯ͸૝ఆ͠ͳ͍ɻʢmicro burst/incast/long session΋͋Γʣ • SwitchͷΩϡʔ͕େ͖͍͜ͱʢ༏ઌΩϡʔʣ͸૝ఆ͠ͳ͍ɻ୯ҰΩϡʔͷΈ૝ఆɻ 8
  7. 3. DESIGN System overview •Critical telemetry data • ϦϯΫ༰ྔ •

    ΩϡʔαΠζ: ݱࡏͷεΠονͷΩϡʔͷঢ়ଶ • TXϨʔτ: ࣮ࡍͷग़ྗϨʔτɻϦϯΫ༰ྔ΍Qͱ߹Θͤͯɺ໨ඪར༻཰΁ͷऩଋ·Ͱͷ࣌ؒΛଌΕ Δ • ૯༧໿ଳҬ: ϦϯΫΛ௨ա͢ΔશͯͷVFͷ࠷௿ଳҬอূͷ߹ܭ • ૯ૹ৴΢Οϯυ΢: ϦϯΫΛ௨ա͢ΔશͯͷVFͷadmission(ૹ৴ڐՄ)΢Οϯυ΢ͷ߹ܭɻॏΈௐ੔ •ΞʔΩςΫνϟͱϫʔΫϑϩʔ • ϗετ (Edge)͕ProbeύέοτΛ౤͛Δɻܦ࿏্ͷεΠονʢCoreʣ͕Probeʹ௥ه͢Δ • uFAB-E͸ϩʔΧϧVF৘ใʢ࠷খଳҬ΍ૹ৴΢Οϯυ΢ʣΛ౤͛Δ • uFAB-C͸INTΛհͯ͠ू໿VF৘ใʢ૯༧໿ଳҬɾ૯ૹ৴΢Οϯυ΢ʣɺNW৘ใʢϦϯΫ༰ ྔɾΩϡʔαΠζɾTXϨʔτʣΛϓϩʔϒʹຒΊࠐΉɻϨεϙϯε͸ͦͷٯΛ΍Δɻ • probeͱresponseΛൺֱͯ͠ɺVMϖΞͷ࠷௿อଳҬอূΛܾఆ or path migration͢Δ 9 μFab-E: Edgeଆ μFab-C: Coreଆ
  8. 3. DESIGN μFAB-E: bandwidth allocation •࠷খଳҬอূ r • ࡞ઓ: VM௨৴ຖʹτʔΫϯͰ෼ׂ͢Δ

    • uFAB-E/CͰphi (token)Λूܭ •ଳҬ্ݶ R (Work conservation) • ࡞ઓ: εΠον্ͷଳҬ΋ߟྀ͢Δ • ͍ͭ΋rΛ࢖͏ͷ͸ඇޮ཰ɻ࢖͑Δͱ͖͸্ݶଳҬRΛ࢖͏ 10 ͋ΔVMؒ௨৴Ͱ ύεlΛ௨ΔBWԼݶ token (a->b) Cap. of link Total token (uFABͰूܭ) ʢClosͳͷͰʣશlinkதͷ࠷খRlΛϗετ͕࢖͏ɻ Rl͸l্ͷVMؒͷactiveͳ௨৴ଳҬͷ߹ܭ ͋ΔVMؒ௨৴Ͱ ύεlΛ௨ΔBW্ݶ token (a->b)
  9. 3. DESIGN μFAB-E: traf fi c admission •Avoiding queuing in

    the core • ଳҬͷΈߟྀ͢ΔͱϚΠΫϩόʔετΛڐ༰ -> tail-latencyѱԽ • window-based fl ow control΋Ұॹʹ࢖͏ • total in fl ight tra ff i cΛ੍ޚɺͱݴ͍׵͑ͯ΋ྑ͍ • ΢Οϯυ΢αΠζ W • Λ࢖ͬͨεΠονΩϡʔͷθϩԽ • μFAB-E/C͔Βͷ৘ใ 11 T_ab: BaseRTT Realtime queue size ಉ༷ʹશlʹରͯ͠࠷খ஋Λ࢖͏ ͋Δlink্ͷab௨৴ͷ΢Οϯυ΢αΠζ:
  10. 3. DESIGN μFAB-C: path migration/informative core •Path migration • 1.

    ݱࡏͷpath͕࢖͑ͳ͘ͳͬͨͨ৔߹: ૉૣ͘΍Δʢ5RTTఔ౓ʣ • 2. NWશମͰϦιʔε͕ΑΓඞཁʹͳͬͨ৔߹: ௿ස౓Ͱ΍Δʢ30ඵఔ౓ʣ • Path selection: • 1ͷ৔߹: ԼݶଳҬΛຬͨ͢path lͷத͔ΒϥϯμϜʹબ୒ • 2ͷ৔߹: ্ݶଳҬΛຬͨ͢path͔ΒબͿ • Avoiding oscillations • pathͷબ୒ɾඇબ୒͕ৼಈ͢ΔՄೳੑ͋Γ • ϥϯμϜͳ[1, N] RTTҎ಺ʹ͚ͩpathબ୒͕ൃಈ͢ΔΑ͏཈੍ 12
  11. 3. DESIGN μFAB-C: informative core •tx_l, q_l, C_l͸ͦͷ··औಘՄೳ •phi_l, W_l͸ܭࢉɾอ࣋͢Δඞཁ͋Γ

    • ͦ΋ͦ΋͍ͭVM௨৴͕։࢝ɾऴ͔ྃͨ͠ඇࣗ໌ • EdgeଆͰอ࣋ɾసૹ(probe)͠ɺCoreଆͰ΋อ࣋͢Δ • Bloom fi lterΛ࢖ͬͯCoreଆͷHWϦιʔεઅ໿ • ޡݕ஌ (false positive)΁ͷରࡦ • ૿ՃଳҬ͕গͳ͍ʢ5%ʣͷͰӨڹগɻͦΕͰ΋ͩΊͳΒpath migrationൃಈ 13
  12. 4. IMPLEMENTATION μFAB-E at smart NICs •ARM SoC SmartNIC൛: L4/L7Λಁաతʹαϙʔτ

    • 10Gbps, DPDK, C++ 8k LoC •FPGA-based Smart NIC൛: ෳࡶ͞ɺޮ཰ɺੑೳΛධՁɻAlves U200 100Gbps • Verily 18k LoC, Verbs I/F with DMA •Barefoot To fi no • P4 3k LoC, Python 3k LoC 14
  13. 4. IMPLEMENTATION μFAB-E at smart NICs •Work fl ow (ૹ৴࣌)

    • VMϖΞঢ়ଶΛCTʹ֨ೲ͠ɺNHΛPM͔Βऔಘ • APIܦ༝ͰʢDMAΛ௨ͯ͠) discΛऔಘ • WFQʹεέδϡʔϦϯάɻDMAͰfetchͯ͠TX͢Δ •Scaling to a large number of VFs • ֊૚ܕWFQ͸Ϧιʔε࢖͍͗͢Δ -> WQ͸8ݸʹݶఆɻ • ςφϯτ͕ඃΔ͜ͱΛڐ༰ʢscalability V.S. costʣ •Scalable probing scheme • ୯७ͳϧʔϓͰૹΔͱVMϖΞ਺ʹൺྫͯ͠૿͑Δ • Ϩεϙϯε͕དྷ͔ͯΒಛఆྔͷτϥϑΟοΫ (1 MTUͳͲ)Λૹ৴ͨ͠Βprobe ΛૹΔ͜ͱͰ཈੍ •8k VM-pairͱ1kςφϯτΛαϙʔτ 15 VMϖΞϨϕϧ: εέδϡʔϥ͸VMϖΞຖʹΩϡʔΛ࣋ͭ VFϨϕϧ: ಉ͡VFͷVMϖΞ͸ಉ͡weighted VF Qάϧʔϓ Context Table: ΞΫςΟϒͳVMϖΞঢ়ଶΛอ࣋ Path Monitor: path୳ࡧͱอ࣋ɻmigration༻ɻ
  14. 4. IMPLEMENTATION μFAB-C at programmable switches •Telemetry • probe಺༰ΛಡΜͰɺtx΍W_lͳͲΛॻ͖໭͢ɻ100byteఔ౓ɻ •Information

    summaryͷ޻෉ • 2way hashing Bloom fi lter of 20KB • ޡݕ஌ < 5%, 20kͷVMϖΞΛαϙʔτ •Handling silently inactive VM-pairs • 10ඵͰage outͤ͞Δ͜ͱͰѱԽΛ๷͙ •Rescue Usage • 20k VMαϙʔτͷͨΊʹHWϦιʔεͷ20%࢖༻ɻ͔͠΋scaleՄೳɻ 16
  15. 5. EVALUATION Setup/Microbenchmarks •Testbed: ӈਤɻ8 servers, 10 P4 Switches •NS3

    simulation: 512 servers, FatTree 16, 32 Switches •ൺֱର৅: PicNIC0+WCC+Clove, ElasticSwitch+Clove •Microbenchmark • Src POD1 VF1/2/3 -> Dst POD2 VF1/2/3, 1G/2G/5G • ్தͰVF4 / 20msΛՃ͑Δ 17 ߴ଎ͳऩଋ ࠷௿ଳҬΛอূ ऩଋ஗͍ ଳҬอূ͸VF3ͷΈ ༨༟͕͋ͬͯ΋ ࠷௿ଳҬͷΈ ଳҬอূҧ൓ͷׂ߹
 ~0%, 40%, 10% μFab͸ऩଋ͕଎͍ͨΊ Q࢖༻ྔ͕গͳ͍
  16. 5. EVALUATION Application-level performance •Multiple ECS tenants • Memcached (஗Ԇଌఆ༻,

    2KB/res)ͱ MongoDB (ଳҬଌఆ༻, 500KB/res) • QPSͰ2.5ഒɺQCTͰ20ഒվળ •Multiple tasks in EBS • SA(64KB/320us)->BA->*3CS <- GC/1ms • latency߹ܭͰ20-30ഒվળ 18 SA: Storage Agent -> BA: Block (Replica) Agent -> CS: Chunk Server <- GC: Garbage Collection (Merge) service
  17. 5. EVALUATION Hardware performance and scalability/Convergence in large scale •HWੑೳʢ100G

    FPGA൛ʣ • 10msຖͷτϥϑΟοΫྲྀೖ->͙͢ʹऩଋ҆͠ఆ • ো֐࣌΋͙͢ʹଞܦ࿏ʹҠߦ • probeͷଳҬΦʔόʔϔου͸1%ఔ౓ •Convergenceൺֱ (NS3) • ಈతͳϫʔΫϩʔυ: 90:1Ͱ500Mbps/4ms, ∞/4ms • ΦʔόʔγϡʔτΛ27ഒվળɺ࠷େ஗Ԇ͸16us • ࣮ϫʔΫϩʔυ૝ఆ: ߴෛՙ࣌ʹଳҬอূҧ൓͕ര૿ • ES͸ෆຬ଍౓͸௿͍ʢ࠷௿อূΑΓ௿͘͢ΔͨΊʣ͕latencyѱԽ •Sensitivity analysis • ߴෛՙ࣌͸wait timeΛ࢖ͬͨ΄͏͕ϚΠάϨऩଋ͕࣌ؒԼ͕Δʢৼಈͤ ͣ҆ఆ͢Δʣ • Ώͬ͘Γprobeͯ͠΋҆ఆ͢Δ 19
  18. 6. DISCUSSION •໌ࣔతͳଳҬׂ౰: Τοδ͔ΒͷprobeΛݩʹϨʔτΛอ࣋ɻ࠷దͳύε͸ݟ͚ͭΒΕͳ͍ɻ •Underlay path਺: μFAB͸path਺͕গͳͯ͘΋ޮՌతʢ૿΍͢͜ͱ΋Մೳʣ •ଳҬτʔΫϯͷׂ౰: boundaryΛ௒ׂ͑ͯΓ౰ͯΒΕΔʢʁʣ •Incast

    probing delayӨڹ: 128:1ͰγϛϡϨʔτ͕ͨ͠໰୊ͳ͠ •ଞͷ੍ޚͱͷڞଘ: μFAB͸BMͰಈ࡞ɻVMଆͰ᫔᫓੍ޚͱڞଘͰ͖Δ •தԝूݖతͳ΋ͷͱͷൺֱ: ୹͍ϑϩʔͰ͸ϨΠςϯγѱԽͷՄೳੑ •ϓϩάϥϚϒϧHW: શϗετͰμFAB-E࢖͏ඞཁ͋Γʢ৘ใෆ׬શʣɻμFAB-C͸SWͰ΋Մೳʁ 20 ͳΔ΄Ͳͱࢥͬͨ΋ͷ
  19. 7. RELATED WORK •ଳҬ: • FairCloud/NetShare: SWଆʹVM/ςφϯτ͝ͱʹQඞཁ • PicNIC/EyeQ/GateKeeper: ଳҬׂ౰Ͱ͖Δ͕ɺ᫔᫓ɾϩάͷͳ͍NW͕લఏ

    •஗Ԇ: • Q࡟ݮɺτϥϑΟοΫ༏ઌॱҐ෇͚ͳͲΛ༻͍ͨઌߦݚڀ༗Γ • ϚϧνςφϯτDCNΛର৅ͱͨ͠΋ͷ͸ແ͍ •Informative core: • طଘݚڀ͸εΠονଆ͔Β৘ใऔಘʢHPCCͰ᫔᫓੍ޚɺCloveͰෛՙ෼ࢄɺNetSeerͰ؂ࢹʣ • μFab͸Edge/Coreͷ྆ํ࢖͏ 21
  20. ׬૸ͨ͠ײ૝ •DCN͕ڊେͳTCP state machineΈɺ͋Δɻ • ͦΕΛ࡞Γ͖ͬͨͷ͕͗͢͢͝ •εϚʔτNICʹ͜ͷखͷػೳ͕ࡌΒͳ͍͔ͳ͋ • ͦΕECN, DCTCP,

    RDMA, RoCEv2, ... •ʢӡ༻্ͷʣੑೳݶքͱ͔Ωϟύϓϥ͕ݟ͑ʹ͍͔͘΋ • HWΠϯϑϥϨΠϠʔͷTCPతͳཱͪҐஔʁ •ࣗ෼ʹͱͬͯ͸೉͔ͬͨ͠ɻར༻ٕज़ྖҬ͕޿ͯ͘ਂ͔ͬͨɻ 24
  21. 3. DESIGN Bounding the worst-case latency •ಉ࣌ʹόʔετ௨৴ཁٻ -> ಉ࣌ʹׂ౰ ->

    ಉ࣌ʹ᫔᫓͢Δ • ଳҬʹ༨༟͕͋Δ৔߹ɺͲͷVMϖΞʹͲΕ͙Β͍࢖ΘͤΔ͔໰୊ •Two-stage tra ffi c admission • ଳҬอূ·Ͱ͸ૉૣ͘ɺwork conservation·Ͱ͸Ώͬ͘Γɻ 26 ྫɿॳظ௨৴࣌ ྫɿଳҬ͕Լ͕ͬͨ৔߹