Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

#51 “Empowering Azure Storage with RDMA”

#51 “Empowering Azure Storage with RDMA”

cafenero_777

March 24, 2024
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. ର৅࿦จ •Empowering Azure Storage with RDMA • Wei Bai, et

    al • Microsoft • NSDI '23 • https://www.usenix.org/conference/nsdi23/presentation/bai • https://www.microsoft.com/en-us/research/publication/empowering-azure- storage-with-rdma/ 2
  2. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Background 3. Overview 4.

    Storage Protocols over RDMA 5. RDMA Estats 6. Switch Management 7. Congestion Control 8. Experience 9. Lessons and Open Problems 10.Related Work 11.Conclusions and Future Work 4
  3. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • Azure ͰVM/Storage, Storage/StorageؒͰRDMAΛར༻ɻRegion಺Ͱ઀ଓɻ • Region಺RDMAͷෳࡶੑ΍inter-operabilityͳͲͷܦݧͷڞ༗ • τϥϑΟΫͷ70%Λ઎ΊΔRDMA௨৴ͷDisk

    I/Oੑೳ޲্ͱCPUઅ໿Λ࣮ݱ •ಡ΋͏ͱͨ͠ཧ༝ • ҎલNSDI '23ΛࣼΊಡΈͯ͠ؾʹͳͬͨͷͰɻ • RDMA in Production • SONiCΛ࢖͍ͬͯΔΒ͍͠ 5
  4. RDMA? RoCEv2? •Remote DMA(Direct Memory Access) • σʔλసૹ஗ԆͱCPUෛՙͷ࡟ݮ •RoCEv2: RDMA

    over Converged Ethernet v2 • Ethernet NWΛ༻͍ͯɺRDMA௨৴Λ࣮ߦ • Ethͳطଘ؀ڥΛ࠶ར༻͠΍͍͢ɻୠ͠loss͕ൃੜ͢Δͱੑೳܹམͪɾɾɾ • ௿஗Ԇ͕ٻΊΒΕΔ΋ͷʢετϨʔδؒ΍GPUؒ௨৴ͳͲʣʹ׆༻ 6
  5. 1. Introduction •SSDશ੝࣌୅Ͱಉ༷ͷੑೳ͕ٻΊΒΕΔ •ετϨʔδ on NW • Clos NWͰଳҬ͸͋Δɻ͕ɺCPUॲཧʹΑΔ஗Ԇ΍CPUෛՙ͕՝୊ •

    ͡Ό͋RDMAͰΦϑϩʔυͤ͞Α͏ •VM/Storage cluster, Storage clusterؒͰRDMA • Intra-region: ಉ͡Region಺ͷҧ͏DCʢҧ͏ݐ෺ʣ •Complexity and heterogeneity • DCQCNαϙʔτ͍ͯ͠Δ͕࣮૷͕ҟͳΔ • ҟͳΔεΠονϕϯμʔͷ࣮૷ࠩҟ • DCؒͷ௕ڑ཭઀ଓͰRTT૿Ճ •ΞϓϦʢL7ʣ͔ΒL2·Ͱमਖ਼ͱ࠷దԽΛ࣮ࢪɻϝτϦΫεػೳͷ֦ॆ͠status؂ࢹɻ •SONiCར༻Ͱιϑτ΢ΣΞελοΫΛ౷ҰɻDCQCN/PFCͰߴεϧʔϓοτɾ௿ϨΠςϯγɾ΄΅θϩϩεΛୡ੒ •2018: όοΫΤϯυΛRDMAԽ, 2019: ϑϩϯτΤϯυΛRDMAԽ 7
  6. 2. Background (1/2) •Clos NW/ECMP •VM/StorageΫϥελ෼཭ •PS, ENͷϨΠϠʔߏ੒ɻEN͸3৑௕ʢӈਤʣ • Write:

    ΞΫςΟϒͳextentʹappend͠ɺ͢΂ͯͷ EN͔ΒԠ౴Λड৴ͯ͠׬ྃ • Read: ೚ҙͳextentΛಡΈࠐΈɺϨεϙϯε͢Δɻ • ͦͷଞ: Garbage collection, erasure coding 8 •T0 (ToR) •T1: T0 clustera •DC: T2Ϋϥελ •Region: RHʹ઀ଓ͞ΕͨLong-haul (~਺10km) N [us] 2 [ms] boxܕSW Shallow bu ff er chassisܕSW Deep bu ff er/VoQ { { { HV Disk driver͕ I/OΛrequest incast ى͖͕ͪ
  7. 2. Background (2/2) •RDMA࠾༻ϝϦοτ • NVMe/SSDಉ౳ੑೳͷཁٻɻ਺ඦus latency, ਺ेGbpsଳҬ • TCP/IPελοΫར༻ʹΑΔॲཧ஗ԆͱCPUෛՙ

    • RDMAར༻: HWར༻ʹґΔ௿஗Ԇ(਺us)ͱߴεϧʔϓοτɺCPUΦϑϩʔυ • VM/StorageؒRDMA͸Ϧʔδϣϯ಺Ͱͷߦ͏ඞཁ͋Γ •՝୊ͱ௅ઓ • طଘࢿ࢈ͷྲྀ༻: RoCEv2 (RDMA over commodity Ethernet v2)ར༻ͱPFC (Priority-based Flow Control)ར༻ʹΑΔύέϩεճආ • Ϧʔδϣϯ಺RDMAʢΫϥελؒRDMAʣͰͷ௅ઓ • ҟͳΔNICͷར༻ʹΑΔDCQCN࣮૷ͷҧ͍ɻGen1-3 • ҟͳΔεΠονHW/OS: όοϑΝΞʔΩςΫνϟ΍αΠζɾϝΧχζϜɾ؂ࢹํ๏͕ҟͳΔɻӡ༻େมɻ • T2-RHؒͰͷ஗Ԇ: ਺us~2ms·Ͱ͹Βͭ͘ɻPFCͰ·͔ͳ͑Δʁ • ϝτϦΫεͱࣗಈϑΣΠϧΦʔόʔ 9 ஗͍&༧ظͰ͖ͳ͍ʂ
  8. 3. Overview •sU-RDAM, sK-RDMA • RDMAϕʔεͷϓϩτίϧɻFEʗBE௨৴Λαϙʔτ •RDMA Estats: RDMAૢ࡞ͷਖ਼֬ͳstatsఏڙ •NW͸PFC/DCQCNΛར༻͠ɺߴεϧʔϓοτɾ௿ϨΠςϯγɾύέϩε΄΅θϩ

    •ιϑτ΢ΣΞελοΫΛ౷Ұ͢ΔͨΊSONiCΛಋೖ •NICͷϑΝʔϜ΢ΣΞߋ৽͠ɺDCQCNͷಈ࡞Λ౷Ұ •PFCετʔϜͷ؇࿨ʢPFC watchdogʣ 10
  9. 4. Storage Protocols over RDMA (1/2) •sU-RDMA • ετϨʔδؒ௨৴ͷRPCϓϩτίϧελοΫ •

    Socket-like API on User-space • Memory (re-)allocation • ड৴ଆͷཁٻड͚ೖΕpre-post • ૹड৴ଆ૒ํͷσʔλαΠζͷ߹ҙ • CreditʹԠͯ͡ૹ৴ํ๏Λมߋ͠ɺόοϑΝͷ༗ޮ׆༻ɻ • খ: RDMAૹड৴Λͦͷ··ར༻ • த: WriteͱWrite/Write DoneΛૹ৴ • େ: SendͰૹΔɻRead, Read DoneΛฦ͢ɻ • Legacy/fail-over༻ʹTCP΋ར༻Մೳ 11 socket-like API
  10. 4. Storage Protocols over RDMA (2/2) •sK-RDMA • VM/ετϨʔδؒ௨৴ͷRPCϓϩτίϧελοΫ •

    Socket-like API on kernel-space • ಉ༷ʹcredit-based, PFC, RDMA/TCP • σʔλసૹ͸ετϨʔδଆ͔Β࢝·Δ఺ʹ஫ҙ 12 ੨ࣈ͸ίϯτϩʔϧϝοηʔδ ੺ࣈ͸σʔλϝοηʔδ
  11. 5. RDMA Estats •Estats: RDMA Extended Statistics • ΤϯυϗετଆͷৄࡉͳϝτϦΫεՄࢹԽ •

    TCP netstatͷΑ͏ͳ΋ͷ • Tx, RX byte, NACK਺ͳͲɻ • ֤ϨΠςϯγΛώετάϥϜԽ͠੒ޭɾࣦഊΛ෼ੳ • ߴϨΠςϯγൃੜ࣌ʹNIC/QP dumpػೳͱղੳ 13 PCIe NW T1: ૹ৴Ωϡʔͷൃߦ T5: NICͰड৴࣌ T6: CPUͰड৴࣌ T6-T1: total operation latency T5-T1: NIC/NW latency
  12. 6. Switch Management •ҟͳΔϕϯμʔͷҟͳΔεΠον: • ༷ʑͳཁٻͷͨΊٕज़ਐԽ͕஗͍ɻҟͳΔASICͷҟͳΔbu ff er arch. •

    SONiC: SAIͰͷந৅ԽɻΫϩεϓϥοτϑΥʔϜͱͯ͠ར༻͠ૢ࡞ɾ؅ཧͷ౷Ұ •SONiCͱRDMA • ඞཁͳ֤ػೳ͕͋Δ: ECN marking, PFC, PFC watchdog shared bu ff er model • Shared bu ff erΛ༻్Ͱ෼ׂར༻ɻόοϑΝαΠζ΍͖͍͠஋Λࢦఆ: lossless༻PFCൃߦͳͲ • Shared 18MB, 6MB for lossless headroom •Testing • PTF: Packet Testing FrameworkΛ༻͍ͨػೳςετ • breakpoint/dumpΛ༻͍ͨૹड৴Ұக֬ೝͰbu ff er/counter/schedΛςετ • HWͱ༻͍ͨus/nsͳtra ff i c generatorͰRED/ECN marking΍PFC watchdogػೳςετɻ͜ͷลΓʁ 14 boxܕSW Shallow bu ff er {
  13. 7. Congestion Control (1/2) •PCF/DCQCNΛ༻͍ͯLong linkͳregionalͰ΋scaleͰ͖ΔΑ͏ʹ͢Δ • ͢΂ͯͷϙʔτ͕ಉ࣌ʹ᫔᫓͢ΔέʔεΛߟ͑ͳ͍ • T2/RH͸o

    ff chip DRAMΛdeep bu ff erͱͯ͠ར༻ɺRDMAΛड͚Δ • PFC headroom͸QຖͰ͸ͳ͘switch ingressຖʹ੩తᮢ஋(Φʔόʔαϒ)Ͱٵऩɻ • ܦݧଇʹΑΔɻ᫔᫓ϩεഉআͱόʔετ଱ੑ͕޲্ •DCQCN interoperability • QP͝ͱͷૹ৴੍ޚʹDCQCNΛར༻ɻ • ૹ৴ऀ/RPɺεΠον/CPɺड৴ऀ/NP • NIC࣮૷ͷҧ͍: • Gen1: CP/NP statemachine͸Firmware࣮૷ͷͨΊɺCNPൃੜΛ࠷খԽɻ • cache missൃੜ͠қ͍ͨΊɺper packetͰ͸ͳ͘packet burst୯ҐͰrate limit • Gen2/3: DCQCN HW࣮૷ɻECNຖʹCNPൃߦ͢Δɻ୯Ґ࣌ؒຖͷCNPͰper packet rate limit • Gen1/23ΛࠞͥΔͱcache miss࿈ൃͰੑೳྼԽ + Gen1ʹա৒ͳCNPΛૹΓɺա৒ʹrate limit͕ൃੜ • Gen2/3ଆͰQPຖʹCNP limitterͰ࿈ଓCNPൃߦΛgen1ͷִؒʹ߹Θͤͨɻ·ͨRPଆ୯Ґ࣌ؒ΋Gen1ಉ౳ʹௐ੔ɻ • Gen2/3Λburstຖʹrate limit͢ΔΑ͏มߋ 15 N [us] 100~m 2 [ms] 10~km boxܕSW Shallow bu ff er chassisܕSW Deep bu ff er/VoQ { { ૹ৴ऀ RP εΠον CP ड৴ऀ NP ECN CNP Reduced rate! Packet ECNΛ༻͍ͨύέοτϑϩʔྫ
  14. 7. Congestion Control (2/2) •DCQCNͷνϡʔχϯά • DCQCNͷάϩʔόϧύϥϝʔλͷΈมߋՄೳ • RDMAͷϑϩʔ͸2छྨͷΈΛαϙʔτ͍ͨ͠ʢfrontend, backendʣ

    • Ϟσϧཧ࿦Ͱಛੑཧղ -> lab testͰଥ౰ͳύϥϝλਪఆ -> ຊ൪ಉ༷ ͷςετ؀ڥͰύϥϝʔλܾఆ • DCQCN͸rate-baseͷͨΊRTTͷunfairness͸ແ͍ • RTT͕େ͖͍΋ͷʹߴ͍εϧʔϓοτΛ৐ͤΔʹ͸େ͖ͳKmax- Kmin or খ͞ͳPmaxʹsparse ECN marking͢Δ • DCQCNͱεΠονόοϑΝ͸Ұॹʹνϡʔχϯά͢Δඞཁ͋Γɻ KminΛେ͖͘͢ΔલʹόοϑΝΛᮢ஋Λେ͖͘͢ΔͳͲɻ 16 σʔληϯλ಺ɾ֎Ͱ෼͚Δɺͱ͸͠ͳ͔ͬͨɻ : 2us : 1.77ms Lab testͷ݁Ռ
  15. 8. Experience Timeline and deployment •Timeline • 2018: Backend (Inter-Storage)

    • 2019: Frontend (Storage/Compute) • 2020: Intra-region RDMA • 2023/02: ALL region supported •ຊ൪؀ڥ΁ͷಋೖͷεςοϓ • Component testing, Same stack testing in testbed • ຊ൪؀ڥʹগͣͭ͠RDMAΛಋೖ w/ NIC driver/ fi rmware update, NOS update • T0εΠον͸SONiCͷfast reboot, warm rebootΛར༻͠1ඵະຬͷ௨৴அ • NICυϥΠόߋ৽͸TCPϞʔυʹҠߦޙʹ࣮ߦ 17
  16. 8. Experience 2018೥ͷA/Bςετ݁Ռ •backend • CPU࢖༻཰ʢstorage App, Storage protocol, TCP/IP

    stackʣ • Backend storage tra ffi c FCT 18 TPSͷߴ͍workloadΛ࣮ߦ •Frontend •SiskSpdͰI/OαΠζ8KBͰIOPS 2ύλʔϯͰςετɻCPU ࠷େ34.5%࡟ݮ •࣮؀ڥVMΛ༻͍ͨςετɻTCP, RDMAͷฏۉϨΠςϯγͷൺֱɻେ͖ͳI/Oͩ ͱ20%Ҏ্࡟ݮCPU࢖༻཰ʢstorage App, Storage protocol, TCP/IP stackʣ ff
  17. 8. Experience ݟ͔ͭͬͨ՝୊ͱमਖ਼ •FMR Hidden fence • RegionରԠ͕ͨ͠ۃ୺ʹεϧʔϓοτ͕஗͍ɾɾɾ • Compute

    nodeͰͷsK-RDMAར༻࣌ʹSendΛૹΔ • ͕ɺಛఆRTT (In fl ight)Ͱ1ͭͷSend͔͠ૹΕͳ͍࢓༷ • ϕϯμʔଆͰυϥΠόमਖ਼ •PFCͱMACsec • ௕ڑ཭௨৴ͷΈԿނ͔drop཰͕ଟ͍ -> PFC frame͕drop͞Ε͍ͯΔɾɾɾ • PFCͷ҉߸Խʹؔ͢Δ࢓༷͕ଘࡏ͠ͳ͍ɾɾɾͷͰɺඪ४Խͨ͠ 19
  18. 9. Lessons and Open Problems •Failover͸ίετߴ͗͢ • ࠷ޙͷखஈͱͯ͠sU/K-RDMAʹ࣮૷ɻ࢖͏=ॏେΠϯγσϯτ • CPU

    core util/loadͷര૿ͱރׇɻͭ·ΓϦεΫɻσϓϩΠ΋৻ॏʹʂ •ϗετؒNWͱ෺ཧεΠονͷ౷߹ • ϗετ಺ʢE-WʣͱϗετؒʢN-SʣɻN-S͸ϦιʔεফඅେɻকདྷͲ͏͢Δʁ •εΠονόοϑΝ • ௿༰ྔͩͱੑೳ໰୊ΛҾ͘܏޲ɻDCQCNͰ͸ͳ͘εΠονύϥϝʔλͰղܾɻ • ͳͥʁ -> όʔεττϥϑΟοΫରࡦɻDCͰ͸͋Γ͕ͪɻ • Deep bu ff erඞཁɻΑΓϞμϯͳbu ff er؅ཧϝΧχζϜʢϓϩάϥϚϒϧʁʣ͕΄͍͠ɻ 20
  19. 10. Related Work •Deployment experience of RDMA • Bing •

    Alibaba (LUNA/SOLAR: Storage backend. user-space TCP, UDP on DPU) • AWS SRD (Scalabe Reliable Datagra) on Nitro NIC for HPC/ML/Storage •RDMAվળϙΠϯτ in DC • CC: ECN, delay-based, int-based, credit-based • ্هҎ֎: dead-lockճආɺmulti-pathαϙʔτɺηΩϡϦςΟɺԾ૝Խɺmulti-tenant •ߴ଎Խͷಓ۩ • Socket, Kernel, SmartNIC 21
  20. 11. Conclusions and Future Work •intra-region RDMA in Azure •

    70% RDMA • ͢΂ͯͷpublic regionͰRDMAαϙʔτ • Disk I/O performanceվળͱCPUίΞઅ໿Λಉ࣌ʹ࣮ݱ •γεςϜΞʔΩςΫνϟɺHW acceleration, CCͰվળ༧ఆ 22