#30 “Serving Netflix Video at 400Gbps on FreeBSD”

Research Paper Introduction #30 “Serving Net fl ix Video at
400Gbps on FreeBSD” ௨ࢉ#86 @cafenero_777 2021/11/04 1

Agenda •ର৅࿦จϓϨθϯ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. NUMAฤ 2. kTLS o ff l
oadฤ 3. ͦͷଞΞʔΩςΫνϟฤ 2

ର৅࿦จ •Serving Net fl ix Video at 400Gbps on FreeBSD
• Drew Gallatin • Net fl ix • EuroBSDcon ’21 • video: https://www.youtube.com/watch?v=_o-HcG8QxPc • slide: https://people.freebsd.org/~gallatin/talks/euro2021.pdf • ೔ຊޠχϡʔε: https://gigazine.net/news/20210922-amd-epyc-push-net fl ix-server-bandwidth 3

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • 400Gbpsಈը഑৴αʔόͷύϑΥʔϚϯε͕ग़ͳ͍ʂ • NUMA, kTLSͷσόοάˍվળͰ΄΅ग़ͨ •ಡ΋͏ͱͨ͠ཧ༝ • 400Gbps!
ݶքʹ௅ઓײʂ • ໨తಛԽܕʢྫɿಈը഑৴αʔόʣ • ΠΧήʔϜ͓΋͠Ζ͍ 4

ॾ஫ҙ •֓Ͷ࿩͞ΕͨॱংͰղઆ •εϥΠυྲྀ༻ + ิ଍Ͱղઆ •bit per sec(bps)ͱByte per sec(B/s,
B/sec)ʹ஫ҙ • ஫໨͢΂͖਺ࣈ͸ԼઢΛҾ͖·ͨ͠ •ղઆ͸ଟগ٭৭͍ͯ͠·͢ 5

1. NAMAฤ 6

Motivation •2020೥͔Β200Gbps & TLSಈը഑৴ @ αʔό1୆ •࣍ͷ400Gbps͸Ͳ͏͢Δʁ •SW stack: FreeBSD-current,
NGINX (web) w/ send fi le(2), software kTLS •HW stack: 7 Netflix Video Serving Hardware • AMD EPYC 7502P (“Rome”) ◦ 32 cores @ 2.5GHz ◦ 256GB DDR4-3200 ▪ 8 channels ▪ ~150GB/s mem bw • Or ~1.2Tb/s in networking units ◦ 128 lanes PCIe Gen4 ▪ ~250GB/s of IO bandwidth • Or ~2Tb/s in networking units Netflix Video Serving Hardware • 2x Mellanox ConnectX-6 Dx ◦ Gen4 x16, 2 full speed 100GbE ports per NIC ▪ 4 x 100GbE in total ◦ Support for NIC kTLS offload • 18x WD SN720 NVME ◦ 2TB ◦ PCIe Gen3 x4 1. NAMAฤ

400Gbpsग़Δʁ 8

400Gbpsग़Δʁ->ग़ͯͳ͍ 9

ॳظঢ়ଶͷ֓ཁ •240Gbps͔͠ग़ͳ͍ •ϝϞϦଳҬ͕ݪҼͬΆ͍ •σʔλϑϩʔ֓ཁਤͰ͓͞Β͍ 10 Performance Results: • 240Gb/s •
Limited by memory BW ◦ Determined empirically by using AMDuProfPCM Netflix 400Gb/s Video Serving Data Flow CPU Disks Memory Network Card 50GB/s 50GB/s Bulk Data Metadata 50GB/s 50GB/s Using sendfile and software kTLS, data is encrypted by the host CPU. 400Gb/s == 50GB/s ~200GB/sec of memory bandwidth and ~64 PCIe Gen 4 lanes are needed to serve 400Gb/s σʔλΛಡΜͰɺ kTLSͯ͠ɺ ഑৴͢Δɻ ߹ܭ200GByte/secͷϝϞϦଳҬ͕ඞཁɻ 1. NAMAฤ

࣮ࡍ͸NUMA-domainΛ·͙ͨ •NUMAؒ (Fabric)͸ࡉ͍&஗Ԇ(12-28ns) •࠷ѱ4ճ·͙ͨՄೳੑ༗Γ •FabricଳҬ͕ṧഭ͠᫔᫓ • CPU stallͯ͠͠·͏ •ઓུɿ”·ͨ͗”ΛݮΒͯ͠200GB/secҎ಺ʹऩΊΔ 11
Cross-Domain costs Bandwidth Limit: • AMD Infinity Fabric ◦ ~47GB/s per link ◦ ~280GB/s total 4 Nodes, worst case Steps to send data: • DMA data from disk to memory ◦ First NUMA bus crossing • CPU reads data for encryption ◦ Second NUMA crossing • CPU writes data for encryption ◦ Third NUMA crossing • DMA data from memory to network ◦ Fourth NUMA crossing 1. NAMAฤ

”·ͨ͗”ͷݪҼͱupstream׆ಈ •lagg(4) ར༻ -> 5 tupleͰegress͕όϥ͚ͯ͠·͏ •࢖͏Disk͕όϥ͚Δ •kTLS͕֬อ͢ΔϝϞϦ͕nodeͰόϥ͚Δ •kTLS worker΍TCP
pacer͕όϥ͚Δ 12 Network centric siloing • Associate network connections with NUMA nodes • Allocate local memory to back media files when they are DMA’ed from disk • Allocate local memory for TLS crypto destination buffers & do SW crypto locally • Run kTLS workers, RACK / BBR TCP pacers with domain affinity • Choose local lagg(4) egress port All of this is upstream! 4 Nodes, worst case with siloing Steps to send data: • DMA data from disk to memory ◦ First NUMA bus crossing • CPU reads data for encryption • CPU writes data for encryption • DMA data from memory to network 4ճ͔Β1ճ·ͨ͗ʹ࡟ݮ 1. NAMAฤ

”·ͨ͗”ͷݪҼͱupstream׆ಈ •lagg(4) ར༻ -> 5 tupleͰegress͕όϥ͚ͯ͠·͏ -> connection/egressͱNUMAͷؔ࿈෇͚* •࢖͏Disk͕όϥ͚Δ ->
DMA͞Εͨ࣌͸ɺNUMA localͳϝϞϦΛؔ࿈෇͚* •kTLS͕֬อ͢ΔϝϞϦ͕nodeͰόϥ͚Δ -> kTLS༻όοϑΝΛNUMA localʹ֬อ* •kTLS worker΍TCP pacer͕όϥ͚Δ -> NUMA a ffi nityͰدͤΔ* 13 Network centric siloing • Associate network connections with NUMA nodes • Allocate local memory to back media files when they are DMA’ed from disk • Allocate local memory for TLS crypto destination buffers & do SW crypto locally • Run kTLS workers, RACK / BBR TCP pacers with domain affinity • Choose local lagg(4) egress port All of this is upstream! 4 Nodes, worst case with siloing Steps to send data: • DMA data from disk to memory ◦ First NUMA bus crossing • CPU reads data for encryption • CPU writes data for encryption • DMA data from memory to network 4ճ͔Β1ճ·ͨ͗ʹ࡟ݮ *upsteamͯ͠ରԠʂ 1. NAMAฤ

΍͔ͬͨʁʂ 14

΍͔ͬͨʁʂ->΍ͬͯͳ͍ɻ 15

࣮ࡍͷHWߏ੒͸Ξϯόϥϯεɻɻɻ •NUMA͸4υϝΠϯ͕ͩσόΠε͸ඇରশ • 100G (2port)NIC * 2ຕ • NUMAຖʹdisk਺͕ҧ͏ •ฏۉ1.25ճʢ࠷ѱ2ճʣ·͙ͨ
•ੑೳ݁Ռ: 280Gbps (40Gbps޲্) 16 Average Case Summary: • 1.25 NUMA crossings on average ◦ 75% of disk reads across NUMA ◦ 50% of NIC transmits across NUMA due to unbalanced setup • 62.5 GB/sec of data on NUMA fabric Performance: 1 vs 4 nodes 240Gbps 280Gbps 1. NAMAฤ

2. kTLSฤ 17

kTLSͱΦϑϩʔυ •ઓུɿkTLSΛNICͰΦϑϩʔυͯ͠ҎԼΛ࡟ݮ • ϝϞϦଳҬͷར༻࡟ݮʢ߹ܭ100GB/secͰे෼ʹͳΔʣ • CPUෛՙͷ࡟ݮ 18 Netflix 400Gb/s Video
Serving Data Flow CPU Disks Memory Network Card 50GB/s 50GB/s Bulk Data Metadata 50GB/s 50GB/s Using sendfile and software kTLS, data is encrypted by the host CPU. 400Gb/s == 50GB/s ~200GB/sec of memory bandwidth and ~64 PCIe Gen 4 lanes are needed to serve 400Gb/s CPUͰkTLS͠ͳ͍ NICͰ΍Δ 2. kTLS o ffl oadฤ

kTLSͷิ଍ •kTLS • TLS҉߸ԽॲཧΛkernel಺Ͱߦ͏ • HWΦϑϩʔυ͠΍͍͢ • Hardware Inline TLS
• TLSηογϣϯ͸ϢʔβۭؒͰอ࣋ •CX-6 Dx • TLSϨίʔυstateΛ࣋ͭ • TCP࠶ૹ࣌ • TLSϨίʔυΛؚΉύέοτΛ·Δ͝ͱ࠶ૹ 19 Mellanox ConnectX-6 Dx • Offloads TLS 1.2 and 1.3 for AES GCM cipher • Retains crypto state within a TLS record ◦ Means that the TCP stack can send partial TLS records without performance loss • If a packet is sent out of order (eg, a TCP retransmit), it must re-DMA the record containing the out of order packet What is NIC kTLS?: • Hardware Inline TLS • TLS session is established in userspace. • When crypto is moved to the kernel, the kernel passes crypto keys to the NIC • TLS records are encrypted by NIC as the data flows through it on transmit ◦ No more detour through the CPU for crypto ◦ This cuts memory BW requirements in half! 2. kTLS o ffl oadฤ

NIC͕stateΛݟͳ͕Β҉߸Խ 20 TCP segments of Plaintext TLS Record Host Memory
NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Encrypted TLS Record TCP segments of Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Encrypted TLS Record 8688 10136 11584 13032 14480 TCP segments of Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Encrypted TLS Record TCP segments of Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Encrypted TLS Record 15928 8688, 10136, …, 14480ΛNICʹసૹ NICͰ҉߸Խ ҉߸ԽࡁΈΛૹ৴ ࣍ͷύέοτΛNICʹసૹ ҰͭલͷNIC্ͷTLS stateΛݩʹ҉߸Խ ҉߸ԽࡁΈΛૹ৴ 2. kTLS o ffl oadฤ

TCP࠶ૹ࣌͸ʁ •શ෦ૹ৴͠௚͠ɻɻɻ • τϥϑΟοΫόʔετ͕ൃੜʁ •SW kTSLʹϑΥʔϧόοΫͤ͞Δํ਑ʢޙड़ʣ 21 TCP segments of
Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Plaintext TLS Record Host Memory NIC PCIe Bus 100GbE Network 0 1448 2896 4344 5792 7240 8688 10136 11584 13032 14480 15928 TCP segments of Encrypted TLS Record ʢશ෦ૹ৴ࡁΈ͕ͩɺTCP࠶ૹཁٻʹΑΓʣ ్தͷύέοτ͚ͩNICʹసૹ͍ͨ͠ NIC্ͷTLS state͕ෆҰகͷͨΊɺ  ͢΂ͯͷύέοτΛNICʹసૹ ҉߸Խͯ͠ૹ৴ ύέοτҰ͚ͭͩΛ೚ҙͷλΠϛϯάͰ҉߸Խͯ͠ૹ৴Ͱ͖ͳ͍ 2. kTLS o ffl oadฤ

ੑೳଌఆʢ̍౓໨ʣ •ॳظFirmware͸τϥϑΟοΫ͕҆ఆ͠ͳ͔ͬͨʢ~75Gbpsʣ •TLSηογϣϯଟ͗ͩ͢ͱੑೳྼԽΛ؍ଌʢ~100session/clientʣ • ࣮ࡍͷݪҼ͸ෆ໌ʢਪଌͰ͸NIC bu ff erͷpaging/thrashing͕ݪҼʣ 22 CX6-DX:
Initial Results Peak:125Gb/s per NIC, (~250Gb/s total) Sustained:75Gb/s per NIC, (~150Gb/s total) • Pre-release Firmware CX6-DX: Initial performance • NIC stores TLS state per-session • We have a lot of sessions active ◦ (~400k sessions for 400Gb/s) ◦ Performance gets worse the more sessions we add • Limited memory on-board NIC ◦ NIC pages in and out to buffers in host RAM ◦ Buffers managed by NIC 2. kTLS o ffl oadฤ

ੑೳଌఆʢN౓໨ʣ •ʢ࢖͑Δ͸ͣͩͬͨʣPCIe Relaxed OrderingΛ࢖͏ •TLS_OPTIMIZEઃఆΛ࢖͏ •҆ఆͯ͠190Gbps per NICΛୡ੒ʂ 23 PCIe
Relaxed Ordering • Allows PCIe transactions to pass each other ◦ Should eliminate pipeline bubbles due to “slow” reads delaying fast ones. ◦ May help with “paging in” TLS connection state • Enabled Relaxed Ordering ◦ Didn’t help ◦ Turns out CX6-DX pre-release firmware hardcoded Relaxed Ordering to disabled CX6-DX: Results from production fw • Firmware update added “TLS_OPTIMIZE” setting • Peak & sustained results improved: 190Gb/s per NIC (~380Gb/s total)! ࢀߟɿ  https://learning.oreilly.com/library/view/pci-express-system/0321156307/ https://www.quora.com/What-is-meant-by-relaxed-ordering-in-PCIE τϥϯβΫγϣϯͷॱংΛม͑ͯ΋ྑ͍ɺͱ͍͏֦ுʢݩʑ͸write΋read΋ॱংมߋෆՄʣ =queue͕ͲΜͲΜࡹ͚͍ͯ͘ 2. kTLS o ffl oadฤ

࣮ࡍͷϫʔΫϩʔυʹ͚ۙͮΔ •QoE (Quality of Experience)ςετ • ৔߹ʹ΋ґΔ͕݁Ռ͸֓Ͷྑ͍ •TPC࠶ૹ͕͖͍͠஋Λ௒͑ͨΒSW kTSLʹϑΥʔϧόοΫ •
1% byteͷύέϩεͰ1/3ͷηογϣϯ͕SWॲཧʹɻεϧʔϓοτ΋CPUޮ཰΋མͪΔ • ͜Ε͸༧૝֎ͷ݁Ռɾɾɾ 24 CX6-DX: What’s needed to use of NIC TLS in production at Netflix? • QoE testing ◦ Measure various factors, such as rebuffer rate, play delay, time to quality, etc. ◦ Initial results are great ◦ Larger, more complete study scheduled soon. CX6-DX:Mixed HW/SW session perf? • Moving a non-trivial percentage of conns to SW has unanticipated BW cost. • Setting SW switch threshold to 1% bytes retransmitted moves ⅓ of conns to SW • Max stable BW moves from 380Gb/s to 350Gb/s with roughly ⅓ of connections in SW ◦ Performance impact is more than expected 2. kTLS o ffl oadฤ

ੑೳଌఆ·ͱΊ 25 240Gbps 380Gbps Base line 280Gbps NUMA࠷దԽ HW kTLS
50% SW fallbackࠐΈ ʢσϓϩΠ༧ఆʣ 350Gbps 65% 2. kTLS o ffl oadฤ

ͦͷଞͷArchitecture 26

ARMܥ •εϖοΫࣗମ͸ྑ͍ •πʔϧෆ଍ • ϓϩϑΝΠϥ΍ύϑΥʔϚϯεΧ΢ϯλ • ׳Εͯͳ͍ͷͰݫ͍͠ɻɻɻ •tagෆ଍ -> ղফͰ320Gb/sΛୡ੒
27 Other platforms? Ampere Altra • “Mt. Snow” ◦ Q80-30: 80 3.0GHz Arm Neoverse-N1 cores ◦ 8 channels of 256GB DDR4-3200 ◦ 128 Lanes Gen4 PCIe ◦ 16x WD SN720 2TB NVMe ◦ 2 Mellanox CX6-DX NICs Ampere: PCIe Extended Tags • After enabling extended tags, we see a bandwidth improvement: 240Gb/s -> 320Gb/s 3. ͦͷଞArch.ฤ

Intelܥ •DDR4-3200͕࣮ͩࡍ͸2933 • ͜͜Ͱ཯଎ •ݱঢ়Ͱ͸PCIe RO͕࢖͑ͳ͍ • HW kTLSͷྑ͍݁Ռ͸ظ଴Ͱ͖ͳ͍ 28
Other platforms? Intel Ice Lake Xeon • 8352V CPU ◦ 36 cores, 2.1GHz ◦ 8 channels 256GB DDR4-3200 (running at 2933) ◦ 64 Lanes Gen4 PCIe ◦ 20x Kioxia 4TB NVMe (PCIe Gen4) ◦ 2 Mellanox CX6-DX NICs Intel Ice Lake Xeon (WIP) • 230Gb/s SW kTLS ◦ Limited by memory BW ▪ 8352V runs memory at 2993, others SKUs run at 3200 • Would expect the same performance as AMD from that • BIOS locked out PCIe Relaxed ordering, so no NIC KTLS results yet 3. ͦͷଞArch.ฤ

ͬ͘͟Γൺֱ 29 3. ͦͷଞArch.ฤ

࣍ճ༧ࠂ •དྷ೥͸800Gbpsͷ࿩ʁʂ 30 But wait, there’s …. not … more..
• 800Gb prototype sitting on datacenter floor due to shipping exception 😞 • Something to talk about next year?

QA •FreeBSD 13͔ΒkTLS͕࢖͑Δ •NFSͰkTLS࢖͏ͱ଎͍ʁ • Α͘෼͔ΒΜɺͱ͍͏ฦ౴ •PLX (broadcom?)ʁʁʁʁʁʁʁʁ •desktop SKUͩͱ΋ͬͱ଎͍DDR͕͋Δ͸ͣɻ
• Կ೥͔લʹoverclockͨ͠DDRαʔό࢖ͬͨ͜ͱ͋ΔΑ •TCPΦϑϩʔυ͸ࠓճͷ࿩Ͱ͸είʔϓ֎ʢଞͷਓ͕΍ͬͯΔΒ͍͠ʣ •128k TLS session/port fl ow͕HWݶքΒ͍͠ •Disk্Ͱ͸DRM encryptedࡁΈɻͦΕΛTLSͰ҉߸Խͯ͠ૹΔ •τϥϑΟοΫεςΞϦϯά͸ಛʹͯ͠ͳ͍ɻLACPʹ͓·͔ͤ •CPU load (%)͸vmstatͰܭଌɻଳҬ͸intel PCMͱ͔ɻ •SmartNIC (Xilinx FPGA)͸লిྗ޲͚ͷೝࣝɻαʔό༻్Ͱ͸NIC͸૬ରతʹলిྗͳͷͰಛʹɻɻʢڵຯ͕ͳ͍ʁʣ 31

ࡾߦ·ͱΊ 32 •EPYC CPUͰNUMAυϝΠϯΛ·͕ͨͳ͍Α͏ʹΧʔωϧupstream •kTLS NIC (HW) o ffl oadͯ͠ɺNUMAؒଳҬར༻ɾCPUෛՙΛ࠷దԽ
•࠷େ380Gbpsͷ഑৴͕Մೳʹͳͬͨ

׬૸ͨ͠ײ૝ •Ϧεχϯάྗෆ଍ʁ஻Γ௚͠͸ฉ͖औΓʹ͍͘ɺɺ • ࣈນ͸΋ͪΖΜ෇͚Δ •࿩͕΍΍͍͜͠ • ݁Ռ͔ΒͰ͸ͳ͘ɺ΍ͬͨॱͰ࿩͢ʢ࿩ऀͷτϨʔε͕ඞཁʣ •࿩ͱεϥΠυ͕߹ͬͯͳ͍Α͏ͳɻɻ 33

EoP 34

#30 “Serving Netflix Video at 400Gbps on FreeBSD”

#30 “Serving Netflix Video at 400Gbps on FreeBSD”

More Decks by cafenero_777

Other Decks in Technology

Featured

Transcript