Slide 1

Slide 1 text

Research Paper Introduction #51 “Empowering Azure Storage with RDMA” ௨ࢉ#128 @cafenero_777 2024/03/21 1

Slide 2

Slide 2 text

ର৅࿦จ •Empowering Azure Storage with RDMA • Wei Bai, et al • Microsoft • NSDI '23 • https://www.usenix.org/conference/nsdi23/presentation/bai • https://www.microsoft.com/en-us/research/publication/empowering-azure- storage-with-rdma/ 2

Slide 3

Slide 3 text

3 71ਓʂ େܕ෺ཧ࣮ݧͷ࿦จΈ͍ͨͳਓ਺ ʢPublic Cloud͸େܕ࣮ݧʁʂʣ

Slide 4

Slide 4 text

Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Background 3. Overview 4. Storage Protocols over RDMA 5. RDMA Estats 6. Switch Management 7. Congestion Control 8. Experience 9. Lessons and Open Problems 10.Related Work 11.Conclusions and Future Work 4

Slide 5

Slide 5 text

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • Azure ͰVM/Storage, Storage/StorageؒͰRDMAΛར༻ɻRegion಺Ͱ઀ଓɻ • Region಺RDMAͷෳࡶੑ΍inter-operabilityͳͲͷܦݧͷڞ༗ • τϥϑΟΫͷ70%Λ઎ΊΔRDMA௨৴ͷDisk I/Oੑೳ޲্ͱCPUઅ໿Λ࣮ݱ •ಡ΋͏ͱͨ͠ཧ༝ • ҎલNSDI '23ΛࣼΊಡΈͯ͠ؾʹͳͬͨͷͰɻ • RDMA in Production • SONiCΛ࢖͍ͬͯΔΒ͍͠ 5

Slide 6

Slide 6 text

RDMA? RoCEv2? •Remote DMA(Direct Memory Access) • σʔλసૹ஗ԆͱCPUෛՙͷ࡟ݮ •RoCEv2: RDMA over Converged Ethernet v2 • Ethernet NWΛ༻͍ͯɺRDMA௨৴Λ࣮ߦ • Ethͳطଘ؀ڥΛ࠶ར༻͠΍͍͢ɻୠ͠loss͕ൃੜ͢Δͱੑೳܹམͪɾɾɾ • ௿஗Ԇ͕ٻΊΒΕΔ΋ͷʢετϨʔδؒ΍GPUؒ௨৴ͳͲʣʹ׆༻ 6

Slide 7

Slide 7 text

1. Introduction •SSDશ੝࣌୅Ͱಉ༷ͷੑೳ͕ٻΊΒΕΔ •ετϨʔδ on NW • Clos NWͰଳҬ͸͋Δɻ͕ɺCPUॲཧʹΑΔ஗Ԇ΍CPUෛՙ͕՝୊ • ͡Ό͋RDMAͰΦϑϩʔυͤ͞Α͏ •VM/Storage cluster, Storage clusterؒͰRDMA • Intra-region: ಉ͡Region಺ͷҧ͏DCʢҧ͏ݐ෺ʣ •Complexity and heterogeneity • DCQCNαϙʔτ͍ͯ͠Δ͕࣮૷͕ҟͳΔ • ҟͳΔεΠονϕϯμʔͷ࣮૷ࠩҟ • DCؒͷ௕ڑ཭઀ଓͰRTT૿Ճ •ΞϓϦʢL7ʣ͔ΒL2·Ͱमਖ਼ͱ࠷దԽΛ࣮ࢪɻϝτϦΫεػೳͷ֦ॆ͠status؂ࢹɻ •SONiCར༻Ͱιϑτ΢ΣΞελοΫΛ౷ҰɻDCQCN/PFCͰߴεϧʔϓοτɾ௿ϨΠςϯγɾ΄΅θϩϩεΛୡ੒ •2018: όοΫΤϯυΛRDMAԽ, 2019: ϑϩϯτΤϯυΛRDMAԽ 7

Slide 8

Slide 8 text

2. Background (1/2) •Clos NW/ECMP •VM/StorageΫϥελ෼཭ •PS, ENͷϨΠϠʔߏ੒ɻEN͸3৑௕ʢӈਤʣ • Write: ΞΫςΟϒͳextentʹappend͠ɺ͢΂ͯͷ EN͔ΒԠ౴Λड৴ͯ͠׬ྃ • Read: ೚ҙͳextentΛಡΈࠐΈɺϨεϙϯε͢Δɻ • ͦͷଞ: Garbage collection, erasure coding 8 •T0 (ToR) •T1: T0 clustera •DC: T2Ϋϥελ •Region: RHʹ઀ଓ͞ΕͨLong-haul (~਺10km) N [us] 2 [ms] boxܕSW Shallow bu ff er chassisܕSW Deep bu ff er/VoQ { { { HV Disk driver͕ I/OΛrequest incast ى͖͕ͪ

Slide 9

Slide 9 text

2. Background (2/2) •RDMA࠾༻ϝϦοτ • NVMe/SSDಉ౳ੑೳͷཁٻɻ਺ඦus latency, ਺ेGbpsଳҬ • TCP/IPελοΫར༻ʹΑΔॲཧ஗ԆͱCPUෛՙ • RDMAར༻: HWར༻ʹґΔ௿஗Ԇ(਺us)ͱߴεϧʔϓοτɺCPUΦϑϩʔυ • VM/StorageؒRDMA͸Ϧʔδϣϯ಺Ͱͷߦ͏ඞཁ͋Γ •՝୊ͱ௅ઓ • طଘࢿ࢈ͷྲྀ༻: RoCEv2 (RDMA over commodity Ethernet v2)ར༻ͱPFC (Priority-based Flow Control)ར༻ʹΑΔύέϩεճආ • Ϧʔδϣϯ಺RDMAʢΫϥελؒRDMAʣͰͷ௅ઓ • ҟͳΔNICͷར༻ʹΑΔDCQCN࣮૷ͷҧ͍ɻGen1-3 • ҟͳΔεΠονHW/OS: όοϑΝΞʔΩςΫνϟ΍αΠζɾϝΧχζϜɾ؂ࢹํ๏͕ҟͳΔɻӡ༻େมɻ • T2-RHؒͰͷ஗Ԇ: ਺us~2ms·Ͱ͹Βͭ͘ɻPFCͰ·͔ͳ͑Δʁ • ϝτϦΫεͱࣗಈϑΣΠϧΦʔόʔ 9 ஗͍&༧ظͰ͖ͳ͍ʂ

Slide 10

Slide 10 text

3. Overview •sU-RDAM, sK-RDMA • RDMAϕʔεͷϓϩτίϧɻFEʗBE௨৴Λαϙʔτ •RDMA Estats: RDMAૢ࡞ͷਖ਼֬ͳstatsఏڙ •NW͸PFC/DCQCNΛར༻͠ɺߴεϧʔϓοτɾ௿ϨΠςϯγɾύέϩε΄΅θϩ •ιϑτ΢ΣΞελοΫΛ౷Ұ͢ΔͨΊSONiCΛಋೖ •NICͷϑΝʔϜ΢ΣΞߋ৽͠ɺDCQCNͷಈ࡞Λ౷Ұ •PFCετʔϜͷ؇࿨ʢPFC watchdogʣ 10

Slide 11

Slide 11 text

4. Storage Protocols over RDMA (1/2) •sU-RDMA • ετϨʔδؒ௨৴ͷRPCϓϩτίϧελοΫ • Socket-like API on User-space • Memory (re-)allocation • ड৴ଆͷཁٻड͚ೖΕpre-post • ૹड৴ଆ૒ํͷσʔλαΠζͷ߹ҙ • CreditʹԠͯ͡ૹ৴ํ๏Λมߋ͠ɺόοϑΝͷ༗ޮ׆༻ɻ • খ: RDMAૹड৴Λͦͷ··ར༻ • த: WriteͱWrite/Write DoneΛૹ৴ • େ: SendͰૹΔɻRead, Read DoneΛฦ͢ɻ • Legacy/fail-over༻ʹTCP΋ར༻Մೳ 11 socket-like API

Slide 12

Slide 12 text

4. Storage Protocols over RDMA (2/2) •sK-RDMA • VM/ετϨʔδؒ௨৴ͷRPCϓϩτίϧελοΫ • Socket-like API on kernel-space • ಉ༷ʹcredit-based, PFC, RDMA/TCP • σʔλసૹ͸ετϨʔδଆ͔Β࢝·Δ఺ʹ஫ҙ 12 ੨ࣈ͸ίϯτϩʔϧϝοηʔδ ੺ࣈ͸σʔλϝοηʔδ

Slide 13

Slide 13 text

5. RDMA Estats •Estats: RDMA Extended Statistics • ΤϯυϗετଆͷৄࡉͳϝτϦΫεՄࢹԽ • TCP netstatͷΑ͏ͳ΋ͷ • Tx, RX byte, NACK਺ͳͲɻ • ֤ϨΠςϯγΛώετάϥϜԽ͠੒ޭɾࣦഊΛ෼ੳ • ߴϨΠςϯγൃੜ࣌ʹNIC/QP dumpػೳͱղੳ 13 PCIe NW T1: ૹ৴Ωϡʔͷൃߦ T5: NICͰड৴࣌ T6: CPUͰड৴࣌ T6-T1: total operation latency T5-T1: NIC/NW latency

Slide 14

Slide 14 text

6. Switch Management •ҟͳΔϕϯμʔͷҟͳΔεΠον: • ༷ʑͳཁٻͷͨΊٕज़ਐԽ͕஗͍ɻҟͳΔASICͷҟͳΔbu ff er arch. • SONiC: SAIͰͷந৅ԽɻΫϩεϓϥοτϑΥʔϜͱͯ͠ར༻͠ૢ࡞ɾ؅ཧͷ౷Ұ •SONiCͱRDMA • ඞཁͳ֤ػೳ͕͋Δ: ECN marking, PFC, PFC watchdog shared bu ff er model • Shared bu ff erΛ༻్Ͱ෼ׂར༻ɻόοϑΝαΠζ΍͖͍͠஋Λࢦఆ: lossless༻PFCൃߦͳͲ • Shared 18MB, 6MB for lossless headroom •Testing • PTF: Packet Testing FrameworkΛ༻͍ͨػೳςετ • breakpoint/dumpΛ༻͍ͨૹड৴Ұக֬ೝͰbu ff er/counter/schedΛςετ • HWͱ༻͍ͨus/nsͳtra ff i c generatorͰRED/ECN marking΍PFC watchdogػೳςετɻ͜ͷลΓʁ 14 boxܕSW Shallow bu ff er {

Slide 15

Slide 15 text

7. Congestion Control (1/2) •PCF/DCQCNΛ༻͍ͯLong linkͳregionalͰ΋scaleͰ͖ΔΑ͏ʹ͢Δ • ͢΂ͯͷϙʔτ͕ಉ࣌ʹ᫔᫓͢ΔέʔεΛߟ͑ͳ͍ • T2/RH͸o ff chip DRAMΛdeep bu ff erͱͯ͠ར༻ɺRDMAΛड͚Δ • PFC headroom͸QຖͰ͸ͳ͘switch ingressຖʹ੩తᮢ஋(Φʔόʔαϒ)Ͱٵऩɻ • ܦݧଇʹΑΔɻ᫔᫓ϩεഉআͱόʔετ଱ੑ͕޲্ •DCQCN interoperability • QP͝ͱͷૹ৴੍ޚʹDCQCNΛར༻ɻ • ૹ৴ऀ/RPɺεΠον/CPɺड৴ऀ/NP • NIC࣮૷ͷҧ͍: • Gen1: CP/NP statemachine͸Firmware࣮૷ͷͨΊɺCNPൃੜΛ࠷খԽɻ • cache missൃੜ͠қ͍ͨΊɺper packetͰ͸ͳ͘packet burst୯ҐͰrate limit • Gen2/3: DCQCN HW࣮૷ɻECNຖʹCNPൃߦ͢Δɻ୯Ґ࣌ؒຖͷCNPͰper packet rate limit • Gen1/23ΛࠞͥΔͱcache miss࿈ൃͰੑೳྼԽ + Gen1ʹա৒ͳCNPΛૹΓɺա৒ʹrate limit͕ൃੜ • Gen2/3ଆͰQPຖʹCNP limitterͰ࿈ଓCNPൃߦΛgen1ͷִؒʹ߹Θͤͨɻ·ͨRPଆ୯Ґ࣌ؒ΋Gen1ಉ౳ʹௐ੔ɻ • Gen2/3Λburstຖʹrate limit͢ΔΑ͏มߋ 15 N [us] 100~m 2 [ms] 10~km boxܕSW Shallow bu ff er chassisܕSW Deep bu ff er/VoQ { { ૹ৴ऀ RP εΠον CP ड৴ऀ NP ECN CNP Reduced rate! Packet ECNΛ༻͍ͨύέοτϑϩʔྫ

Slide 16

Slide 16 text

7. Congestion Control (2/2) •DCQCNͷνϡʔχϯά • DCQCNͷάϩʔόϧύϥϝʔλͷΈมߋՄೳ • RDMAͷϑϩʔ͸2छྨͷΈΛαϙʔτ͍ͨ͠ʢfrontend, backendʣ • Ϟσϧཧ࿦Ͱಛੑཧղ -> lab testͰଥ౰ͳύϥϝλਪఆ -> ຊ൪ಉ༷ ͷςετ؀ڥͰύϥϝʔλܾఆ • DCQCN͸rate-baseͷͨΊRTTͷunfairness͸ແ͍ • RTT͕େ͖͍΋ͷʹߴ͍εϧʔϓοτΛ৐ͤΔʹ͸େ͖ͳKmax- Kmin or খ͞ͳPmaxʹsparse ECN marking͢Δ • DCQCNͱεΠονόοϑΝ͸Ұॹʹνϡʔχϯά͢Δඞཁ͋Γɻ KminΛେ͖͘͢ΔલʹόοϑΝΛᮢ஋Λେ͖͘͢ΔͳͲɻ 16 σʔληϯλ಺ɾ֎Ͱ෼͚Δɺͱ͸͠ͳ͔ͬͨɻ : 2us : 1.77ms Lab testͷ݁Ռ

Slide 17

Slide 17 text

8. Experience Timeline and deployment •Timeline • 2018: Backend (Inter-Storage) • 2019: Frontend (Storage/Compute) • 2020: Intra-region RDMA • 2023/02: ALL region supported •ຊ൪؀ڥ΁ͷಋೖͷεςοϓ • Component testing, Same stack testing in testbed • ຊ൪؀ڥʹগͣͭ͠RDMAΛಋೖ w/ NIC driver/ fi rmware update, NOS update • T0εΠον͸SONiCͷfast reboot, warm rebootΛར༻͠1ඵະຬͷ௨৴அ • NICυϥΠόߋ৽͸TCPϞʔυʹҠߦޙʹ࣮ߦ 17

Slide 18

Slide 18 text

8. Experience 2018೥ͷA/Bςετ݁Ռ •backend • CPU࢖༻཰ʢstorage App, Storage protocol, TCP/IP stackʣ • Backend storage tra ffi c FCT 18 TPSͷߴ͍workloadΛ࣮ߦ •Frontend •SiskSpdͰI/OαΠζ8KBͰIOPS 2ύλʔϯͰςετɻCPU ࠷େ34.5%࡟ݮ •࣮؀ڥVMΛ༻͍ͨςετɻTCP, RDMAͷฏۉϨΠςϯγͷൺֱɻେ͖ͳI/Oͩ ͱ20%Ҏ্࡟ݮCPU࢖༻཰ʢstorage App, Storage protocol, TCP/IP stackʣ ff

Slide 19

Slide 19 text

8. Experience ݟ͔ͭͬͨ՝୊ͱमਖ਼ •FMR Hidden fence • RegionରԠ͕ͨ͠ۃ୺ʹεϧʔϓοτ͕஗͍ɾɾɾ • Compute nodeͰͷsK-RDMAར༻࣌ʹSendΛૹΔ • ͕ɺಛఆRTT (In fl ight)Ͱ1ͭͷSend͔͠ૹΕͳ͍࢓༷ • ϕϯμʔଆͰυϥΠόमਖ਼ •PFCͱMACsec • ௕ڑ཭௨৴ͷΈԿނ͔drop཰͕ଟ͍ -> PFC frame͕drop͞Ε͍ͯΔɾɾɾ • PFCͷ҉߸Խʹؔ͢Δ࢓༷͕ଘࡏ͠ͳ͍ɾɾɾͷͰɺඪ४Խͨ͠ 19

Slide 20

Slide 20 text

9. Lessons and Open Problems •Failover͸ίετߴ͗͢ • ࠷ޙͷखஈͱͯ͠sU/K-RDMAʹ࣮૷ɻ࢖͏=ॏେΠϯγσϯτ • CPU core util/loadͷര૿ͱރׇɻͭ·ΓϦεΫɻσϓϩΠ΋৻ॏʹʂ •ϗετؒNWͱ෺ཧεΠονͷ౷߹ • ϗετ಺ʢE-WʣͱϗετؒʢN-SʣɻN-S͸ϦιʔεফඅେɻকདྷͲ͏͢Δʁ •εΠονόοϑΝ • ௿༰ྔͩͱੑೳ໰୊ΛҾ͘܏޲ɻDCQCNͰ͸ͳ͘εΠονύϥϝʔλͰղܾɻ • ͳͥʁ -> όʔεττϥϑΟοΫରࡦɻDCͰ͸͋Γ͕ͪɻ • Deep bu ff erඞཁɻΑΓϞμϯͳbu ff er؅ཧϝΧχζϜʢϓϩάϥϚϒϧʁʣ͕΄͍͠ɻ 20

Slide 21

Slide 21 text

10. Related Work •Deployment experience of RDMA • Bing • Alibaba (LUNA/SOLAR: Storage backend. user-space TCP, UDP on DPU) • AWS SRD (Scalabe Reliable Datagra) on Nitro NIC for HPC/ML/Storage •RDMAվળϙΠϯτ in DC • CC: ECN, delay-based, int-based, credit-based • ্هҎ֎: dead-lockճආɺmulti-pathαϙʔτɺηΩϡϦςΟɺԾ૝Խɺmulti-tenant •ߴ଎Խͷಓ۩ • Socket, Kernel, SmartNIC 21

Slide 22

Slide 22 text

11. Conclusions and Future Work •intra-region RDMA in Azure • 70% RDMA • ͢΂ͯͷpublic regionͰRDMAαϙʔτ • Disk I/O performanceվળͱCPUίΞઅ໿Λಉ࣌ʹ࣮ݱ •γεςϜΞʔΩςΫνϟɺHW acceleration, CCͰվળ༧ఆ 22

Slide 23

Slide 23 text

EoP 23