Slide 1

Slide 1 text

Research Paper Introduction #44 “Predictable vFabric on Informative Data Plane” ௨ࢉ#113 @cafenero_777 2022/01/26 1

Slide 2

Slide 2 text

Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. MOTIVATION 3. DESIGN 4. IMPLEMENTATION 5. EVALUATION 6. DISCUSSION 7. RELATED WORK 8. CONCLUSION 2

Slide 3

Slide 3 text

ର৅࿦จ •Predictable vFabric on Informative Data Plane • Shuai Wang, et al • Tsinghua University, Alibaba Group, Zhongguancun Laboratory, University of Cambridge • ACM SIGCOMM ’22 • https://conferences.sigcomm.org/sigcomm/2022/program.html • https://dl.acm.org/doi/abs/10.1145/3544216.3544241 • https://www.youtube.com/watch?v=93-p-x2cdQo&ab_channel=ACMSIGCOMM 3

Slide 4

Slide 4 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • ΞϦόόͷϚϧνςφϯτ؀ڥͷNW՝୊Λղܾ • μFAB: SmartNIC/Switch࿈ܞͰϑϩʔ୯Ґͷద੾ͳܦ࿏ɾଳҬͷׂ౰ • QPSΛ2ഒ޲্ɺtail-latencyΛ21ഒվળ •ಡ΋͏ͱͨ͠ཧ༝ • SmartNIC + P4 Switch in DCNW • େن໛Ϣʔεέʔε 4

Slide 5

Slide 5 text

1. INTRODUCTION •ϚϧνςφϯτDC • VM΍VF (Virtual NW Fabric = ֤ςφϯτͷԾ૝NW)Ͱ෼཭ • ଳҬอূɺಈతଳҬׂ౰ɺtail latency͕՝୊ • ྫɿAWS EBS: 100ms, 1ms (tail)͕ඞཁ • ྫɿ෼ࢄػցֶश: ෛՙ͕compute͔ΒNW΁Ҡߦɺ10msͰԠ౴͕ඞཁ •E2EଳҬอূ: ར༻ޮ཰ύΠϓ (src/dst path)Ͱ෼ࢄ != ଳҬอূύΠϓɻ୯७ʹલऀʹׂΓ౰ͯΔͱrate͕શ෦௿͘ͳͬͯ͠·͏ɻ •ώϡʔϦεςΟοΫʹܾΊͯ͠·͏ͱɺ͖Ίࡉ͔͍ಈత੍ޚ͕Ͱ͖ͳ͍-> programmable XͰ΍Ζ͏ • μFAB: programmable NIC/SwitchͰଳҬอূͷ࣮ݱͱಈతଳҬׂ౰ͷਝ଎Խɺtail latency࡟ݮ • Core (Switch)͔ΒINTͰEdge (NIC)ʹ৘ใఏڙ͠ɺNICଆͰύεબ୒΍ଳҬ੍ޚ • ՝୊: τϥϑΟοΫͷٸมɻಉظతͳburst/incast ௨৴ͷڐ༰ɻଳҬ΍ܦ࿏ϑϥοϓରࡦɻ 5

Slide 6

Slide 6 text

2. MOTIVATION (1/2) Practical challenges for predictable VFs •DCN͸ϕετΤϑΥʔτ͕ଟ͍ • ௿Ίͷ࢖༻཰΍ECMP෼ࢄɺshort burst -> tail latency •࣮ྫFig1, 2: SWͰ٧·Δ -> PCIeͰ٧·Δ->NIC Qʹཷ·Δ •࣮ྫFig3: ECMP Hash polarization໰୊ 6 ฏۉଳҬ͸໰୊ͳ͍͕ɺtail latency͸50ഒ͙Β͍૿͑Δ SA: Storage Agent -> BA: Block (Replica) Agent -> CS: Chunk Server <- GC: Garbage Collection (Merge) service ฏۉଳҬ͸҆ఆͯ͠Δ͕ɺtail latency͸10ഒ͙Β͍ͰΔ ϋογϡ෼ۃ໰୊ (Hash polarization) ্ԼҐSwitchͷhashΞϧΰϦζϜ͕ಉͨ͡Ίɺ ֤uplinkϙʔτ܈ͷଳҬ͕ภͬͯ͠·͏ɻ

Slide 7

Slide 7 text

2. MOTIVATION (2/2) State-of-the-art solutions cannot help out •PicNIC: edgeଳҬอূ͸Ͱ͖Δ͕ɺfabric͸είʔϓ֎ •Seawall: Weighted-CCͰNWશମͷଳҬׂ౰͢Δ͕ɺશମಉظ͕஗͍(਺ेms) •Clove: pathͷଳҬʹԠͯ͡ܦ࿏ΛܾΊΔɻςφϯτ৘ใ͸ͳ͍ͷͰͿ͔ͭΔՄೳੑ •IncaseγφϦΦͰ࣮ݧ • ҟͳΔςφϯτ͔ΒಉҰϗετ΁500Mbpsྲྀ͢ -> tail latencyѱԽ • ଳҬ࠷దԽ͸஗Ԇ૿Ճ͠΍͍͢ • ଳҬอূ΋ݫ͍͠ •ղܾࡦɿedgeͱcoreͷ౷߹ (μFAB) • core(εΠον)͔Βedge(ϗετ)΁৘ใΛ఻ൖ • edgeଆͰૹ৴੍ޚ͢Δ 7

Slide 8

Slide 8 text

3. DESIGN Design goals and assumptions •Model: full-bisection fabric • VMؒͷGuarantee Partitioning (GP)Λ༻͍Δ •Design goals • ࠷খଳҬͷอূ: VF͸vNICʹରͯ͠࠷௿ଳҬΛਝ଎ʹఏڙ • ଳҬಈత֬อ: ଳҬ͕͋Ε͹࠷௿ଳҬҎ্ΛvNICʹఏڙ • tail-latency཈੍: όʔετ௨৴ʹରͯ͠vNICؒE2E஗ԆΛ཈੍ •Ծఆ͢Δ͜ͱ • Programmable switchΛ࢖͏͜ͱɻ·ͨɺDCNτϙϩδ͸ط஌ͱͯ͠ɺࣗ໌ͳ࠷௿ଳҬͷ࣮ݱղ͕͋Δ͜ͱɻ •Ծఆ͠ͳ͍͜ͱ • congestion-free network core͸૝ఆ͠ͳ͍ɻSingle/multi pathͰ΋Φʔόʔαϒͯ͠΋ྑ͍ɻ • ׬શͳෛՙ෼ࢄʢhashʹ͸ภΓ͋Γʣ͸૝ఆ͠ͳ͍ɻಛఆͷτϥϑΟοΫύλʔϯ͸૝ఆ͠ͳ͍ɻʢmicro burst/incast/long session΋͋Γʣ • SwitchͷΩϡʔ͕େ͖͍͜ͱʢ༏ઌΩϡʔʣ͸૝ఆ͠ͳ͍ɻ୯ҰΩϡʔͷΈ૝ఆɻ 8

Slide 9

Slide 9 text

3. DESIGN System overview •Critical telemetry data • ϦϯΫ༰ྔ • ΩϡʔαΠζ: ݱࡏͷεΠονͷΩϡʔͷঢ়ଶ • TXϨʔτ: ࣮ࡍͷग़ྗϨʔτɻϦϯΫ༰ྔ΍Qͱ߹Θͤͯɺ໨ඪར༻཰΁ͷऩଋ·Ͱͷ࣌ؒΛଌΕ Δ • ૯༧໿ଳҬ: ϦϯΫΛ௨ա͢ΔશͯͷVFͷ࠷௿ଳҬอূͷ߹ܭ • ૯ૹ৴΢Οϯυ΢: ϦϯΫΛ௨ա͢ΔશͯͷVFͷadmission(ૹ৴ڐՄ)΢Οϯυ΢ͷ߹ܭɻॏΈௐ੔ •ΞʔΩςΫνϟͱϫʔΫϑϩʔ • ϗετ (Edge)͕ProbeύέοτΛ౤͛Δɻܦ࿏্ͷεΠονʢCoreʣ͕Probeʹ௥ه͢Δ • uFAB-E͸ϩʔΧϧVF৘ใʢ࠷খଳҬ΍ૹ৴΢Οϯυ΢ʣΛ౤͛Δ • uFAB-C͸INTΛհͯ͠ू໿VF৘ใʢ૯༧໿ଳҬɾ૯ૹ৴΢Οϯυ΢ʣɺNW৘ใʢϦϯΫ༰ ྔɾΩϡʔαΠζɾTXϨʔτʣΛϓϩʔϒʹຒΊࠐΉɻϨεϙϯε͸ͦͷٯΛ΍Δɻ • probeͱresponseΛൺֱͯ͠ɺVMϖΞͷ࠷௿อଳҬอূΛܾఆ or path migration͢Δ 9 μFab-E: Edgeଆ μFab-C: Coreଆ

Slide 10

Slide 10 text

3. DESIGN μFAB-E: bandwidth allocation •࠷খଳҬอূ r • ࡞ઓ: VM௨৴ຖʹτʔΫϯͰ෼ׂ͢Δ • uFAB-E/CͰphi (token)Λूܭ •ଳҬ্ݶ R (Work conservation) • ࡞ઓ: εΠον্ͷଳҬ΋ߟྀ͢Δ • ͍ͭ΋rΛ࢖͏ͷ͸ඇޮ཰ɻ࢖͑Δͱ͖͸্ݶଳҬRΛ࢖͏ 10 ͋ΔVMؒ௨৴Ͱ ύεlΛ௨ΔBWԼݶ token (a->b) Cap. of link Total token (uFABͰूܭ) ʢClosͳͷͰʣશlinkதͷ࠷খRlΛϗετ͕࢖͏ɻ Rl͸l্ͷVMؒͷactiveͳ௨৴ଳҬͷ߹ܭ ͋ΔVMؒ௨৴Ͱ ύεlΛ௨ΔBW্ݶ token (a->b)

Slide 11

Slide 11 text

3. DESIGN μFAB-E: traf fi c admission •Avoiding queuing in the core • ଳҬͷΈߟྀ͢ΔͱϚΠΫϩόʔετΛڐ༰ -> tail-latencyѱԽ • window-based fl ow control΋Ұॹʹ࢖͏ • total in fl ight tra ff i cΛ੍ޚɺͱݴ͍׵͑ͯ΋ྑ͍ • ΢Οϯυ΢αΠζ W • Λ࢖ͬͨεΠονΩϡʔͷθϩԽ • μFAB-E/C͔Βͷ৘ใ 11 T_ab: BaseRTT Realtime queue size ಉ༷ʹશlʹରͯ͠࠷খ஋Λ࢖͏ ͋Δlink্ͷab௨৴ͷ΢Οϯυ΢αΠζ:

Slide 12

Slide 12 text

3. DESIGN μFAB-C: path migration/informative core •Path migration • 1. ݱࡏͷpath͕࢖͑ͳ͘ͳͬͨͨ৔߹: ૉૣ͘΍Δʢ5RTTఔ౓ʣ • 2. NWશମͰϦιʔε͕ΑΓඞཁʹͳͬͨ৔߹: ௿ස౓Ͱ΍Δʢ30ඵఔ౓ʣ • Path selection: • 1ͷ৔߹: ԼݶଳҬΛຬͨ͢path lͷத͔ΒϥϯμϜʹબ୒ • 2ͷ৔߹: ্ݶଳҬΛຬͨ͢path͔ΒબͿ • Avoiding oscillations • pathͷબ୒ɾඇબ୒͕ৼಈ͢ΔՄೳੑ͋Γ • ϥϯμϜͳ[1, N] RTTҎ಺ʹ͚ͩpathબ୒͕ൃಈ͢ΔΑ͏཈੍ 12

Slide 13

Slide 13 text

3. DESIGN μFAB-C: informative core •tx_l, q_l, C_l͸ͦͷ··औಘՄೳ •phi_l, W_l͸ܭࢉɾอ࣋͢Δඞཁ͋Γ • ͦ΋ͦ΋͍ͭVM௨৴͕։࢝ɾऴ͔ྃͨ͠ඇࣗ໌ • EdgeଆͰอ࣋ɾసૹ(probe)͠ɺCoreଆͰ΋อ࣋͢Δ • Bloom fi lterΛ࢖ͬͯCoreଆͷHWϦιʔεઅ໿ • ޡݕ஌ (false positive)΁ͷରࡦ • ૿ՃଳҬ͕গͳ͍ʢ5%ʣͷͰӨڹগɻͦΕͰ΋ͩΊͳΒpath migrationൃಈ 13

Slide 14

Slide 14 text

4. IMPLEMENTATION μFAB-E at smart NICs •ARM SoC SmartNIC൛: L4/L7Λಁաతʹαϙʔτ • 10Gbps, DPDK, C++ 8k LoC •FPGA-based Smart NIC൛: ෳࡶ͞ɺޮ཰ɺੑೳΛධՁɻAlves U200 100Gbps • Verily 18k LoC, Verbs I/F with DMA •Barefoot To fi no • P4 3k LoC, Python 3k LoC 14

Slide 15

Slide 15 text

4. IMPLEMENTATION μFAB-E at smart NICs •Work fl ow (ૹ৴࣌) • VMϖΞঢ়ଶΛCTʹ֨ೲ͠ɺNHΛPM͔Βऔಘ • APIܦ༝ͰʢDMAΛ௨ͯ͠) discΛऔಘ • WFQʹεέδϡʔϦϯάɻDMAͰfetchͯ͠TX͢Δ •Scaling to a large number of VFs • ֊૚ܕWFQ͸Ϧιʔε࢖͍͗͢Δ -> WQ͸8ݸʹݶఆɻ • ςφϯτ͕ඃΔ͜ͱΛڐ༰ʢscalability V.S. costʣ •Scalable probing scheme • ୯७ͳϧʔϓͰૹΔͱVMϖΞ਺ʹൺྫͯ͠૿͑Δ • Ϩεϙϯε͕དྷ͔ͯΒಛఆྔͷτϥϑΟοΫ (1 MTUͳͲ)Λૹ৴ͨ͠Βprobe ΛૹΔ͜ͱͰ཈੍ •8k VM-pairͱ1kςφϯτΛαϙʔτ 15 VMϖΞϨϕϧ: εέδϡʔϥ͸VMϖΞຖʹΩϡʔΛ࣋ͭ VFϨϕϧ: ಉ͡VFͷVMϖΞ͸ಉ͡weighted VF Qάϧʔϓ Context Table: ΞΫςΟϒͳVMϖΞঢ়ଶΛอ࣋ Path Monitor: path୳ࡧͱอ࣋ɻmigration༻ɻ

Slide 16

Slide 16 text

4. IMPLEMENTATION μFAB-C at programmable switches •Telemetry • probe಺༰ΛಡΜͰɺtx΍W_lͳͲΛॻ͖໭͢ɻ100byteఔ౓ɻ •Information summaryͷ޻෉ • 2way hashing Bloom fi lter of 20KB • ޡݕ஌ < 5%, 20kͷVMϖΞΛαϙʔτ •Handling silently inactive VM-pairs • 10ඵͰage outͤ͞Δ͜ͱͰѱԽΛ๷͙ •Rescue Usage • 20k VMαϙʔτͷͨΊʹHWϦιʔεͷ20%࢖༻ɻ͔͠΋scaleՄೳɻ 16

Slide 17

Slide 17 text

5. EVALUATION Setup/Microbenchmarks •Testbed: ӈਤɻ8 servers, 10 P4 Switches •NS3 simulation: 512 servers, FatTree 16, 32 Switches •ൺֱର৅: PicNIC0+WCC+Clove, ElasticSwitch+Clove •Microbenchmark • Src POD1 VF1/2/3 -> Dst POD2 VF1/2/3, 1G/2G/5G • ్தͰVF4 / 20msΛՃ͑Δ 17 ߴ଎ͳऩଋ ࠷௿ଳҬΛอূ ऩଋ஗͍ ଳҬอূ͸VF3ͷΈ ༨༟͕͋ͬͯ΋ ࠷௿ଳҬͷΈ ଳҬอূҧ൓ͷׂ߹
 ~0%, 40%, 10% μFab͸ऩଋ͕଎͍ͨΊ Q࢖༻ྔ͕গͳ͍

Slide 18

Slide 18 text

5. EVALUATION Application-level performance •Multiple ECS tenants • Memcached (஗Ԇଌఆ༻, 2KB/res)ͱ MongoDB (ଳҬଌఆ༻, 500KB/res) • QPSͰ2.5ഒɺQCTͰ20ഒվળ •Multiple tasks in EBS • SA(64KB/320us)->BA->*3CS <- GC/1ms • latency߹ܭͰ20-30ഒվળ 18 SA: Storage Agent -> BA: Block (Replica) Agent -> CS: Chunk Server <- GC: Garbage Collection (Merge) service

Slide 19

Slide 19 text

5. EVALUATION Hardware performance and scalability/Convergence in large scale •HWੑೳʢ100G FPGA൛ʣ • 10msຖͷτϥϑΟοΫྲྀೖ->͙͢ʹऩଋ҆͠ఆ • ো֐࣌΋͙͢ʹଞܦ࿏ʹҠߦ • probeͷଳҬΦʔόʔϔου͸1%ఔ౓ •Convergenceൺֱ (NS3) • ಈతͳϫʔΫϩʔυ: 90:1Ͱ500Mbps/4ms, ∞/4ms • ΦʔόʔγϡʔτΛ27ഒվળɺ࠷େ஗Ԇ͸16us • ࣮ϫʔΫϩʔυ૝ఆ: ߴෛՙ࣌ʹଳҬอূҧ൓͕ര૿ • ES͸ෆຬ଍౓͸௿͍ʢ࠷௿อূΑΓ௿͘͢ΔͨΊʣ͕latencyѱԽ •Sensitivity analysis • ߴෛՙ࣌͸wait timeΛ࢖ͬͨ΄͏͕ϚΠάϨऩଋ͕࣌ؒԼ͕Δʢৼಈͤ ͣ҆ఆ͢Δʣ • Ώͬ͘Γprobeͯ͠΋҆ఆ͢Δ 19

Slide 20

Slide 20 text

6. DISCUSSION •໌ࣔతͳଳҬׂ౰: Τοδ͔ΒͷprobeΛݩʹϨʔτΛอ࣋ɻ࠷దͳύε͸ݟ͚ͭΒΕͳ͍ɻ •Underlay path਺: μFAB͸path਺͕গͳͯ͘΋ޮՌతʢ૿΍͢͜ͱ΋Մೳʣ •ଳҬτʔΫϯͷׂ౰: boundaryΛ௒ׂ͑ͯΓ౰ͯΒΕΔʢʁʣ •Incast probing delayӨڹ: 128:1ͰγϛϡϨʔτ͕ͨ͠໰୊ͳ͠ •ଞͷ੍ޚͱͷڞଘ: μFAB͸BMͰಈ࡞ɻVMଆͰ᫔᫓੍ޚͱڞଘͰ͖Δ •தԝूݖతͳ΋ͷͱͷൺֱ: ୹͍ϑϩʔͰ͸ϨΠςϯγѱԽͷՄೳੑ •ϓϩάϥϚϒϧHW: શϗετͰμFAB-E࢖͏ඞཁ͋Γʢ৘ใෆ׬શʣɻμFAB-C͸SWͰ΋Մೳʁ 20 ͳΔ΄Ͳͱࢥͬͨ΋ͷ

Slide 21

Slide 21 text

7. RELATED WORK •ଳҬ: • FairCloud/NetShare: SWଆʹVM/ςφϯτ͝ͱʹQඞཁ • PicNIC/EyeQ/GateKeeper: ଳҬׂ౰Ͱ͖Δ͕ɺ᫔᫓ɾϩάͷͳ͍NW͕લఏ •஗Ԇ: • Q࡟ݮɺτϥϑΟοΫ༏ઌॱҐ෇͚ͳͲΛ༻͍ͨઌߦݚڀ༗Γ • ϚϧνςφϯτDCNΛର৅ͱͨ͠΋ͷ͸ແ͍ •Informative core: • طଘݚڀ͸εΠονଆ͔Β৘ใऔಘʢHPCCͰ᫔᫓੍ޚɺCloveͰෛՙ෼ࢄɺNetSeerͰ؂ࢹʣ • μFab͸Edge/Coreͷ྆ํ࢖͏ 21

Slide 22

Slide 22 text

7. RELATED WORK (ൺֱද) 22 ࠓճग़͖ͯͨ΋ͷ

Slide 23

Slide 23 text

8. CONCLUSION •μFAB: ϚϧνςφϯτDCNͰ༧ଌՄೳͳVFΛఏڙ • Active Edge/Informative Coreͷ༥߹ • ςφϯτϨϕϧͰଳҬอূɾ஗ԆอূΛαϒϛϦऩଋͰ࣮ݱ • ൚༻SmartNICͱϓϩάϥϚϒϧεΠονͰ࣮૷ 23

Slide 24

Slide 24 text

׬૸ͨ͠ײ૝ •DCN͕ڊେͳTCP state machineΈɺ͋Δɻ • ͦΕΛ࡞Γ͖ͬͨͷ͕͗͢͢͝ •εϚʔτNICʹ͜ͷखͷػೳ͕ࡌΒͳ͍͔ͳ͋ • ͦΕECN, DCTCP, RDMA, RoCEv2, ... •ʢӡ༻্ͷʣੑೳݶքͱ͔Ωϟύϓϥ͕ݟ͑ʹ͍͔͘΋ • HWΠϯϑϥϨΠϠʔͷTCPతͳཱͪҐஔʁ •ࣗ෼ʹͱͬͯ͸೉͔ͬͨ͠ɻར༻ٕज़ྖҬ͕޿ͯ͘ਂ͔ͬͨɻ 24

Slide 25

Slide 25 text

ิ଍εϥΠυ 25

Slide 26

Slide 26 text

3. DESIGN Bounding the worst-case latency •ಉ࣌ʹόʔετ௨৴ཁٻ -> ಉ࣌ʹׂ౰ -> ಉ࣌ʹ᫔᫓͢Δ • ଳҬʹ༨༟͕͋Δ৔߹ɺͲͷVMϖΞʹͲΕ͙Β͍࢖ΘͤΔ͔໰୊ •Two-stage tra ffi c admission • ଳҬอূ·Ͱ͸ૉૣ͘ɺwork conservation·Ͱ͸Ώͬ͘Γɻ 26 ྫɿॳظ௨৴࣌ ྫɿଳҬ͕Լ͕ͬͨ৔߹

Slide 27

Slide 27 text

ࢀߟจݙ •https://en.wikipedia.org/wiki/Work-conserving_scheduler 27

Slide 28

Slide 28 text

EoP 28