Slide 1

Slide 1 text

Rapid Start: Faster Internet Connections, with Ruby’s Help Kazuho Oku, Fastly

Slide 2

Slide 2 text

● lead developer of the H2O HTTP server ■ used by Fastly ■ has its own HTTP/1, 2, 3, TLS/1.3, QUIC implementation ■ supports HTTP routing using mruby (Rack) ● as a hobby programmer: ○ rat (ruby-based IPv4 NAT) ● as a co-author of RFCs: ○ RFC 8297 (HTTP 103 Early Hints) ○ RFC 9218 (HTTP Extensible Priorities) ○ RFC 9849 (TLS Encrypted Client Hello) Who am I 2

Slide 3

Slide 3 text

● Rapid Start ○ Fastly’s new startup algorithm of its congestion control ● jrf - our ruby-based tool for log analysis ● Visualization of network-related performance tests Topics 3

Slide 4

Slide 4 text

Rapid Start

Slide 5

Slide 5 text

● TCP+TLS/1.2: ○ full handshake: 3 RT ○ resumption: 2 RT ● TCP+TLS/1.3: ○ full handshake: 2 RT ○ resumption: 1 RT ● QUIC: ○ full handshake: 1 RT ○ resumption: 0 RT ※RT = number of round-trips Time to establish a connection 5

Slide 6

Slide 6 text

● TCP+TLS/1.2: ○ full handshake: 3 RT ○ resumption: 2 RT ● TCP+TLS/1.3: ○ full handshake: 2 RT ○ resumption: 1 RT ● QUIC: ○ full handshake: 1 RT ○ resumption: 0 RT ※RT = number of round-trips Time to establish a connection HTTP/3 6

Slide 7

Slide 7 text

● With HTTP/3, handshake latency is minimized: ○ full handshake: 1 RT ○ resumption: 0 RT Reducing the latency of HTTP 7

Slide 8

Slide 8 text

● With HTTP/3, handshake latency is minimized: ○ full handshake: 1 RT ○ resumption: 0 RT ● Time To First Byte (TTFB) is: ○ full handshake: 2 RT ○ resumption: 1 RT Reducing the latency of HTTP 8

Slide 9

Slide 9 text

● With HTTP/3, handshake latency is minimized: ○ full handshake: 1 RT ○ resumption: 0 RT ● Time To First Byte (TTFB) is: ○ full handshake: 2 RT ○ resumption: 1 RT ● What about Time To Last Byte (TTLB)? ○ TTLB is typically TTFB plus the speed of Slow Start Reducing the latency of HTTP 9

Slide 10

Slide 10 text

● Initial phase of congestion control: ○ used when the available bandwidth is unknown ○ to quicly determine the available bandwidth Slow Start 10

Slide 11

Slide 11 text

● Initial phase of congestion control: ○ used when the available bandwidth is unknown ○ to quicly determine the available bandwidth ● Start by sending IW packets: ○ IW = 10 (RFC), 30 (real-world) ○ send 2x as more for each ack received Slow Start 11

Slide 12

Slide 12 text

● Initial phase of congestion control: ○ used to quickly fulfill the available bandwidth, unknown at the beginning of the connection ● Starts by sending IW packets: ○ IW = 10 (RFC), 30 (real-world) ○ send 2x as more for each ack received ● When packets are dropped (i.e., the network overflows), slow start enters “recovery” to repair lost packets, then congestion control switches to the second phase, known as congestion avoidance Slow Start 12

Slide 13

Slide 13 text

Slow Start and BDP 13

Slide 14

Slide 14 text

Slow Start and BDP 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP: number of packets needed to fully utilize the bottleneck link without building queue Queue builds up when packets arrive faster than the bottleneck link When the queue overflows, packets are dropped bottleneck link

Slide 15

Slide 15 text

● Idle BDP = 55Mb/s * 0.039s 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP queue Slow Start and BDP

Slide 16

Slide 16 text

● Idle BDP = 55Mb/s * 0.039s ≒ 2.15Mb ≒ 268KB ≒ 209 packets 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP queue Slow Start and BDP

Slide 17

Slide 17 text

● Idle BDP = 55Mb/s * 0.039s ≒ 2.15Mb ≒ 268KB ≒ 209 packets ● With Slow Start: 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP queue Slow Start and BDP

Slide 18

Slide 18 text

● Idle BDP = 55Mb/s * 0.039s ≒ 2.15Mb ≒ 268KB ≒ 209 packets ● With Slow Start: ○ 1RT: 30 packets 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP queue Slow Start and BDP

Slide 19

Slide 19 text

● Idle BDP = 55Mb/s * 0.039s ≒ 2.15Mb ≒ 268KB ≒ 209 packets ● With Slow Start: ○ 1RT: 30 packets ○ 2RT: 60 packets 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP queue Slow Start and BDP

Slide 20

Slide 20 text

● Idle BDP = 55Mb/s * 0.039s ≒ 2.15Mb ≒ 268KB ≒ 209 packets ● With Slow Start: ○ 1RT: 30 packets ○ 2RT: 60 packets ○ 3RT: 120 packets 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP queue Slow Start and BDP

Slide 21

Slide 21 text

● Idle BDP = 55Mb/s * 0.039s ≒ 2.15Mb ≒ 268KB ≒ 209 packets ● With Slow Start: ○ 1RT: 30 packets ○ 2RT: 60 packets ○ 3RT: 120 packets ○ 4RT: 240 packets 0 1 2 3 8 d c b a f e 4 5 6 7 9 Idle BDP queue bottleneck link is finally saturated Slow Start and BDP

Slide 22

Slide 22 text

Vertical axis: bytes sent / acked (cumulative) Horizontal axis: time elapsed (milliseconds) Black dot: packet sent Yellow dot: ack received (reflects when the receiver received packets) This network on simulator 22

Slide 23

Slide 23 text

Vertical axis: bytes sent / acked (cumulative) Horizontal axis: time elapsed (milliseconds) Black dot: packet sent Yellow dot: ack received (reflects when the receiver received packets) This network on simulator 23

Slide 24

Slide 24 text

Vertical axis: bytes sent / acked (cumulative) Horizontal axis: time elapsed (milliseconds) Black dot: packet sent Yellow dot: ack received (reflects when the receiver received packets) This network on simulator nothing is received, as the sender stops initial sending after 0.5 RTT 24

Slide 25

Slide 25 text

Vertical axis: bytes sent / acked (cumulative) Horizontal axis: time elapsed (milliseconds) Black dot: packet sent Yellow dot: ack received (reflects when the receiver received packets) This network on simulator nothing is received, as the sender stops initial sending after 0.5 RTT underutilization 25

Slide 26

Slide 26 text

Impact of queuing and drops (VDSL) 26

Slide 27

Slide 27 text

Impact of queuing and drops (VDSL) moments of idle 27

Slide 28

Slide 28 text

Impact of queuing and drops (VDSL) moments of idle queuing due to bursty sending 28

Slide 29

Slide 29 text

Impact of queuing and drops (VDSL) moments of idle queuing due to bursty sending packet drops, and hence recoveries due to queue overflow 29

Slide 30

Slide 30 text

Impact of queuing and drops (VDSL) moments of idle queuing due to bursty sending packet drops, and hence recoveries due to queue overflow 2nd recovery happens almost immediately after the 1st 30

Slide 31

Slide 31 text

Impact of queuing and drops (VDSL) data cannot be used until packet drops are repaired moments of idle queuing due to bursty sending packet drops, and hence recoveries due to queue overflow 2nd recovery happens almost immediately after the 1st 31

Slide 32

Slide 32 text

Impact of queuing and drops (VDSL) data cannot be used until packet drops are repaired moments of idle queuing due to bursty sending packet drops, and hence recoveries due to queue overflow 2nd recovery happens almost immediately after the 1st excess draining followed by a burst 32

Slide 33

Slide 33 text

● Utilize the available bandwidth as soon as possible ○ Initial window larger than 30 packets ○ More aggressive growth than 2x per RTT Think of an ideal startup 33

Slide 34

Slide 34 text

● Utilize the available bandwidth as soon as possible ○ Initial window larger than 30 packets ○ More aggressive growth than 2x per RTT ● Minimalze the negative impact of packet drops ○ To avoid drops in short transmissions, delay the initial drop as late as possible ○ Reduce the number of recovery events ○ Reduce the number of packets dropped per each recovery Think of an ideal startup 34

Slide 35

Slide 35 text

● Utilize the available bandwidth as soon as possible ○ Initial window larger than 30 packets ○ More aggressive growth than 2x per RTT ● Minimalze the negative impact of packet drops ○ To avoid drops in short transmissions, delay the initial drop as late as possible ○ Reduce the number of recovery events ○ Reduce the number of packets dropped per each recovery ● To mitigate the risk of overflowing a queue other than that immediately before the bottleneck, avoid bursty sending Think of an ideal startup 35

Slide 36

Slide 36 text

● Proposed by Fastly (1st Internet-Draft submitted in Nov 2025) ※CWND: estimate of the full BDP (idle BDP + queue capacity) Rapid Start Slow Start Rapid Start initial sending stops after 0.5 RTT stops after 1 RTT increase 2x per RTT 3x per RTT (switches to 2x when observing queue buildup) recovery CWND *= 0.5 determine CWND based on packet drop ratio 36

Slide 37

Slide 37 text

● Slow Start: ○ Sends IW packets for 0.5 RTT ● Rapid Start: ○ Sends 2x IW packets for 1 RTT Rapid Start: initial sending 37

Slide 38

Slide 38 text

● Slow Start: ○ Sends IW packets for 0.5 RTT ● Rapid Start: ○ Sends 2x IW packets for 1 RTT ○ Risk: potential queue buildup and earlier packet drops ■ But no more bursty than Slow Start, as the interval between each packet sent remains the same Rapid Start: initial sending 38

Slide 39

Slide 39 text

● Slow Start: 2x ● Rapid Start: ○ queue_buildup?note ? 3x : 2x Note: recommended threshold is: rtt_floor > min(rtt_min + 4ms, rtt_min * 1.1), where rtt_floor is the smallest RTT observed over the most recent 1RT Rapid Start: CWND increase 39

Slide 40

Slide 40 text

● Slow Start: 2x ● Rapid Start: ○ queue_buildup?note ? 3x : 2x ○ rationale: queue buildup is an outcome of the sender sending faster than the bottleneck link ■ slower increase delays the chance of packet drops Note: recommended threshold is: rtt_floor > min(rtt_min + 4ms, rtt_min * 1.1), where rtt_floor is the smallest RTT observed over the most recent 1RT Rapid Start: CWND increase 40

Slide 41

Slide 41 text

● Slow Start enters 2nd recovery almost immediately Slow Start’s recovery problem 41

Slide 42

Slide 42 text

Recap: Impact of queuing and drops data cannot be used until packet drops are repaired moments of idle queuing due to bursty sending packet drops, and hence recoveries due to queue overflow 2nd recovery happens almost immediately after the 1st excess draining followed by a burst 42

Slide 43

Slide 43 text

● Slow Start enters 2nd recovery almost immediately Slow Start’s recovery problem 43

Slide 44

Slide 44 text

● Slow Start enters 2nd recovery almost immediately, because: Slow Start’s recovery problem 44

Slide 45

Slide 45 text

● Slow Start enters 2nd recovery almost immediately, because: ○ Packet drops are observed 1 RT after overflow (i.e., when CWND ~ full BDP) Slow Start’s recovery problem 45

Slide 46

Slide 46 text

● Slow Start enters 2nd recovery almost immediately, because: ○ Packet drops are observed 1 RT after overflow (i.e., when CWND ~ full BDP) ○ As Slow Start increases CWND by 2x per RTT, CWND ~ 2x full_BDP when observing a drop Slow Start’s recovery problem 46

Slide 47

Slide 47 text

● Slow Start enters 2nd recovery almost immediately, because: ○ Packet drops are observed 1 RT after overflow (i.e., when CWND ~ full BDP) ○ As Slow Start increases CWND by 2x per RTT, CWND ~ 2x full_BDP when observing a drop ○ Reducing CWND to half yields the full BDP, and therefore congestion control immediatly fulfills the bottleneck Slow Start’s recovery problem 47

Slide 48

Slide 48 text

● Slow Start enters 2nd recovery almost immediately, because: ○ Packet drops are observed 1 RT after overflow (i.e., when CWND ~ full BDP) ○ As Slow Start increases CWND by 2x per RTT, CWND ~ 2x full_BDP when observing a drop ○ Reducing CWND to half yields the full BDP, and therefore congestion control immediatly fulfills the bottleneck ● Reducing CWND to ¼ is not a good solution, because that would fully drain the queue, leading to underutilization of the bottleneck link Slow Start’s recovery problem 48

Slide 49

Slide 49 text

● For each ack or packet drop, gradually decrease CWND, so that, at the CWND recovery_exit becomes: 0.5 * bytes_acked_in_recovery ○ because bytes acked in 1 RT reflects the full BDP ● Benefits: ○ Works regardless of the increase ratio ○ As CWND is gradually reduced, transmission resumes before the queue is fully drained Rapid Start: recovery 49

Slide 50

Slide 50 text

● Upon entering recovery: cwnd *= 5/6 ● For each ACK: cwnd -= 1/3 * bytes_newly_acked ● For each loss: cwnd -= 5/6 * bytes_newly_lost See draft-kazuho-ietf-rapid-start-02 to see how these constants are derived Rapid Start: recovery 50

Slide 51

Slide 51 text

Rapid Start: recovery 51

Slide 52

Slide 52 text

Rapid Start on Simulator (VDSL) 52

Slide 53

Slide 53 text

Rapid Start on Simulator (VDSL) no idle moments 53

Slide 54

Slide 54 text

Rapid Start on Simulator (VDSL) no idle moments queue buildup 54

Slide 55

Slide 55 text

Rapid Start on Simulator (VDSL) no idle moments queue buildup packet drops 55

Slide 56

Slide 56 text

Rapid Start on Simulator (VDSL) enters recovery only once, but takes longer to repair drops due to 3x overshoot no idle moments queue buildup packet drops 56

Slide 57

Slide 57 text

Rapid Start on Simulator (VDSL) enters recovery only once, but takes longer to repair drops due to 3x overshoot no idle moments queue buildup packet drops lands at a the right queue depth 57

Slide 58

Slide 58 text

Evaluating in Production

Slide 59

Slide 59 text

● HTTP/3 connections divided into 4 groups ● For connections serving cached objects >= 200KB as the first request, record transport-level statistics and TTLB, when all bytes for that cached objects are acked ● for 1 week on 7 POPs across the globe: East / SE Asia, East / West Europe, Africa, North / South America Setup: divided into 4 groups initial sending increase recovery baseline (slow start) 30 pkts in 0.5 RTT 2x CWND *= 0.5 jumpstart 60 pkts in 1 RTT 2x CWND *= 0.5 rapid-wo-jump 30 pkts in 0.5 RTT 3x / 2x CWND reduced relative to loss ratio rapidstart 60 pkts 1 RTT 59

Slide 60

Slide 60 text

{"module":"h2o","type":"h3s_stream0_ttlb","tid":397502,"time":1773026516280,"conn_id":1907798,"method":"GET","content_length":226578,"ttlb":364,"num-pa ckets.received":24,"num-packets.decryption-failed":0,"num-packets.sent":191,"num-packets.lost":0,"num-packets.lost-time-threshold":0,"num-packets.ack-r eceived":191,"num-packets.late-acked":0,"num-packets.initial-received":2,"num-packets.zero-rtt-received":0,"num-packets.handshake-received":2,"num-pack ets.initial-sent":1,"num-packets.zero-rtt-sent":0,"num-packets.handshake-sent":4,"num-packets.received-out-of-order":0,"num-packets.received-ecn-ect0": 0,"num-packets.received-ecn-ect1":0,"num-packets.received-ecn-ce":0,"num-packets.acked-ecn-ect0":0,"num-packets.acked-ecn-ect1":0,"num-packets.acked-ec n-ce":0,"num-packets.sent-promoted-paths":0,"num-packets.ack-received-promoted-paths":0,"num-packets.max-delayed":0,"num-packets.delayed-used":0,"num-b ytes.received":4737,"num-bytes.sent":236943,"num-bytes.lost":0,"num-bytes.ack-received":236895,"num-bytes.stream-data-sent":231728,"num-bytes.stream-da ta-resent":226,"num-frames-received.padding":3259,"num-frames-received.ping":1,"num-frames-received.ack":19,"num-frames-received.reset_stream":0,"num-f rames-received.stop_sending":0,"num-frames-received.crypto":2,"num-frames-received.new_token":0,"num-frames-received.stream":2,"num-frames-received.max _data":0,"num-frames-received.max_stream_data":0,"num-frames-received.max_streams_bidi":0,"num-frames-received.max_streams_uni":0,"num-frames-received. data_blocked":0,"num-frames-received.stream_data_blocked":0,"num-frames-received.streams_blocked":0,"num-frames-received.new_connection_id":0,"num-fram es-received.retire_connection_id":0,"num-frames-received.path_challenge":0,"num-frames-received.path_response":0,"num-frames-received.transport_close": 0,"num-frames-received.application_close":0,"num-frames-received.handshake_done":0,"num-frames-received.datagram":0,"num-frames-received.ack_frequency" :0,"num-frames-received.immediate_ack":0,"num-frames-sent.padding":0,"num-frames-sent.ping":1,"num-frames-sent.ack":3,"num-frames-sent.reset_stream":0, "num-frames-sent.stop_sending":0,"num-frames-sent.crypto":7,"num-frames-sent.new_token":2,"num-frames-sent.stream":188,"num-frames-sent.max_data":0,"nu m-frames-sent.max_stream_data":0,"num-frames-sent.max_streams_bidi":0,"num-frames-sent.max_streams_uni":0,"num-frames-sent.data_blocked":0,"num-frames- sent.stream_data_blocked":0,"num-frames-sent.streams_blocked":0,"num-frames-sent.new_connection_id":6,"num-frames-sent.retire_connection_id":0,"num-fra mes-sent.path_challenge":0,"num-frames-sent.path_response":0,"num-frames-sent.transport_close":0,"num-frames-sent.application_close":0,"num-frames-sent .handshake_done":1,"num-frames-sent.datagram":0,"num-frames-sent.ack_frequency":0,"num-frames-sent.immediate_ack":0,"num-paths.created":0,"num-paths.va lidated":0,"num-paths.validation-failed":0,"num-paths.migration-elicited":0,"num-paths.promoted":0,"num-paths.closed-no-dcid":0,"num-paths.ecn-validate d":0,"num-paths.ecn-failed":1,"num-ptos":1,"num-handshake-timeouts":0,"num-initial-handshake-exceeded":0,"num-jumpstart-applicable":1,"quic.jumpstart.a pplicable":1,"num-rapid-start":0,"num-paced":1,"num-respected-app-limited":0,"handshake-confirmed-msec":369,"jumpstart.prev-rate":0,"jumpstart.prev-rtt ":0,"jumpstart.new-rtt":106,"jumpstart.cwnd":0,"quic.jumpstart.time-to-idle":647,"token-sent.at":0,"token-sent.rate":579889,"token-sent.rtt":67,"rtt.mi nimum":66,"rtt.smoothed":81,"rtt.variance":19,"rtt.latest":75,"loss-thresholds.use-packet-based":1,"loss-thresholds.time-based-percentile":128,"cc.cwnd ":273280,"cc.ssthresh":4294967295,"cc.cwnd-initial":44160,"cc.cwnd-exiting-slow-start":0,"cc.exit-slow-start-at":9223372036854775807,"cc.cwnd-exiting-j umpstart":0,"cc.cwnd-minimum":4294967295,"cc.cwnd-maximum":273280,"cc.num-loss-episodes":0,"cc.num-ecn-loss-episodes":0,"delivery-rate.latest":210149," delivery-rate.smoothed":739154,"delivery-rate.stdev":1078961,"num-sentmap-packets-largest":89} Example: stats for 1 connection 60

Slide 61

Slide 61 text

● Size of the dataset in 1 experiment: ○ LDJSON of 20M lines; 80GB (3.7GB in .gz) Analyzing data 61

Slide 62

Slide 62 text

● Size of the dataset in 1 experiment: ○ LDJSON of 20M lines; 80GB (3.7GB in .gz) ● Need to apply various ad-hoc queries: ○ jq is the obvious choice, however… Analyzing data 62

Slide 63

Slide 63 text

● The grammar is not intuitive ● Slow ● Not suited for processing huge LDJSON ○ Example: | min buffers the entire input ○ when log analysis is almost alywas a streaming, map-reduce-like operation of huge data Issues with jq 63

Slide 64

Slide 64 text

jq -s '{ "min": (map(."rtt.minimum") | min), "max": (map(."rtt.minimum") | max), "avg": (map(."rtt.minimum") | add / length), }' min/max/avg over rtt.minimum 64

Slide 65

Slide 65 text

jq -s '{ "min": (map(."rtt.minimum") | min), "max": (map(."rtt.minimum") | max), "avg": (map(."rtt.minimum") | add / length), }' min/max/avg over rtt.minimum -s buffers entire input; jq essentially stops working when the input is larger than RAM size 65

Slide 66

Slide 66 text

jq -n ' reduce inputs as $o ( {min: null, max: null, sum: 0, n: 0}; ($o."rtt.minimum") as $x | { min: (if .min == null or $x < .min then $x else .min end), max: (if .max == null or $x > .max then $x else .max end), sum: (.sum + $x), n: (.n + 1), } ) | { min, max, avg: (.sum / .n), }' min/max/avg over rtt.minimum With -n, each JSON object is processed separately; but aggregation logic needs to be hand-written 66

Slide 67

Slide 67 text

● Streaming processing is easy to write ● JSON parser is fast ● The script is JIT-compiled Writing ruby scripts instead 67

Slide 68

Slide 68 text

● Streaming processing is easy to write ● JSON parser is fast ● The script is JIT-compiled ● However: ○ It becomes too long as an one-liner ○ Ends up as a script with many many options ■ Hard to maintain Writing ruby scripts instead 68

Slide 69

Slide 69 text

● Streaming processing is easy to write ● JSON parser is fast ● The script is JIT-compiled ● However: ○ It becomes too long as an one-liner ○ Ends up as a script with many many options ■ Hard to maintain ○ Letting AI write is an option, but how would you verify your ad-hoc query is converted to correct code? Writing ruby scripts instead 69

Slide 70

Slide 70 text

● more SQL-like grammar + ruby DSL Writing jq (improved) in ruby 70

Slide 71

Slide 71 text

● more SQL-like grammar + ruby DSL ● compile the query language using eval ○ let JIT optimize the runtime and the query altogether Writing jq (improved) in ruby 71

Slide 72

Slide 72 text

● more SQL-like grammar + ruby DSL ● compile the query language using eval ○ let JIT optimize the runtime and the query altogether ● Streaming processing of NDJSON Writing jq (improved) in ruby 72

Slide 73

Slide 73 text

jrf '{ "min" => min(_["rtt.minimum"]), "max" => max(_["rtt.minimum"]), "avg" => average(_["rtt.minimum"]), }' jrf 73

Slide 74

Slide 74 text

# Filter then extract jrf 'select(_["x"] > 10) >> _["foo"]' # Aggregate jrf 'select(_["item"] == "Apple") >> sum(_["count"])' jrf 'percentile(_["ttlb"], 0.50)' # Group by key and aggregate jrf 'group_by(_["item"]) { |row| sum(row["count"] * row["price"]) }' jrf 74

Slide 75

Slide 75 text

● Syntax: stage connected using >> ○ Each stage is just a ruby block ● Filter: ○ select(expr) ● Transform: ○ _["foo"] ● Aggregation: ○ min(expr), max(expr), sum(expr), … ○ reduce(initial) { any ruby code } jrf 75

Slide 76

Slide 76 text

class Stage def initialize(block, src : nil) ... @ctx = Class.new(RowContext) do define_method(:__jrf_expr__, &block) end end end # instantiated as: Stage.new(eval("proc { #{stage[:src]} }", ...)) jrf - internals Each stage expression is converted to a method, and gets called 76

Slide 77

Slide 77 text

● In typical log processing: ○ filtering and transformation happen before aggregation ○ logs are split into multiple files jrf -P 10 'filter >> transform >> reduce' Jrf - automatic paralellization 77

Slide 78

Slide 78 text

● In typical log processing: ○ filtering and transformation happen before aggregation ○ logs are split into multiple files ● Therefore, processing of each file can be parallelized for: ○ filtering and transformations in stages upfront jrf -P 10 'filter >> transform >> reduce' jrf - automatic paralellization 78

Slide 79

Slide 79 text

● In typical log processing: ○ filtering and transformation happen before aggregation ○ logs are split into multiple files ● Therefore, processing of each file can be parallelized for: ○ filtering and transformations in stages upfront ○ certain aggregations (e.g., min, max, sum) ■ each thread calculates its own, then the results are merged jrf -P 10 'filter >> transform >> reduce' jrf - automatic paralellization 79

Slide 80

Slide 80 text

● Internally, jrf does the following: 1. Dry-run the 1st JSON object for each stage to find the first few stages that can be parallelized. 2. Calls fork(2) and spawns workers that process those stages in parallel. 3. Each worker emits its result as NDJSON to a pipe 4. The main process reads from the pipes and feed the input to the remaining stages. jrf - automatic paralellization 80

Slide 81

Slide 81 text

● min: ○ jq -s 'map(."rtt.minimum") | min' ○ jq -n 'reduce inputs."rtt.minimum" as $x (null; if . == null or $x < . then $x else . end)' ○ jrf 'min(_["rtt.minimum"])' ○ jrf -P 10 'min(_["rtt.minimum"])' jrf - benchmark 81

Slide 82

Slide 82 text

jrf - benchmark 950MB (single file) 81.4GB (29 files) min(rtt.minim um) TTLB percentile delta min(rtt.minim um) TTLB percentile delta jq -s out of memory jq -n jrf jrf -P 10 all units in seconds 82

Slide 83

Slide 83 text

● TTLB percentile delta: ○ jq -n ' include "helpers"; 0.1 as $step | reduce inputs as $row ( {"baseline": [], "jumpstart": [], "rapid-no-jump": [], "rapidstart": []}; if ($row | base_cond(200000; 400000)) then .[$row | group_name] += [$row.ttlb] else . End ) | with_entries(.value |= percentiles($step)) | .baseline as $baseline | with_entries(select(.key != "baseline")) | with_entries( .value |= [range(0; length) as $i | (.[$i] / $baseline[$i] - 1)] )' jrf - benchmark 83

Slide 84

Slide 84 text

● TTLB percentile delta: ○ jrf 'select(base_cond(_, 200000, 400000)) >> [group_name(_), _["ttlb"]] >> group_by(_[0]) { percentile(_[1], $perc ||= 0.05.step(0.95, 0.1)) } >> map_values{|arr| arr.zip(_["baseline"]).map {|v,bv| v.to_f / bv - 1 } } >> _.reject{|k| k == "baseline"}' ○ jrf -P 10 '...(same as above)...' jrf - benchmark 84

Slide 85

Slide 85 text

jrf - benchmark 950MB (single file) 81.4GB (29 files) min(rtt.minim um) TTLB percentile delta min(rtt.minim um) TTLB percentile delta jq -s 7.93 8.52 out of memory jq -n 7.45 13.59 667.44 > 1800 jrf 2.29 2.39 226.80 240.92 jrf -P 10 2.27 2.38 31.41 31.69 all units in seconds 85

Slide 86

Slide 86 text

jrf - benchmark 950MB (single file) 81.4GB (29 files) min(rtt.minim um) TTLB percentile delta min(rtt.minim um) TTLB percentile delta jq -s 7.93 8.52 out of memory jq -n 7.45 13.59 667.44 > 1800 jrf 2.29 2.39 226.80 240.92 jrf -P 10 2.27 2.38 31.41 31.69 3.3x 3.6x 21x > 50x all units in seconds 86

Slide 87

Slide 87 text

all units in seconds jrf - benchmark 950MB (single file) 81.4GB (29 files) min(rtt.minim um) TTLB percentile delta min(rtt.minim um) TTLB percentile delta jq -s 7.93 8.52 out of memory jq -n 7.45 13.59 667.44 > 1800 jrf 2.29 2.39 226.80 240.92 jrf -P 10 2.27 2.38 31.41 31.69 3.3x 3.6x 21x > 50x 2.6GB/s 87

Slide 88

Slide 88 text

● Written 99.9% by Codex and Claude ○ Required thorough human design review; otherwise, AI often broke the design structure that warrants efficiency ● Productivity and correctness improved thanks to: ○ AI generating the engine (jrf) and its test suite ○ Humans and AI writing jrf queries in the DSL, which are declarative, concise, easier to understand and maintain jrf - use of AI 88

Slide 89

Slide 89 text

● Now that we have the tool, how do we present the TTLBs as charts? ○ next slide shows an example (of mine) from IETF 121 Visualization of A/B tests 89

Slide 90

Slide 90 text

No content

Slide 91

Slide 91 text

● Now that we have the tool, how do we present the TTLBs as charts? ○ apparently, 2D charts using percentiles / TTLB do not work Visualization of A/B tests 91

Slide 92

Slide 92 text

● Now that we have the tool, how do we present the TTLBs as charts? ○ apparently, 2D charts using percentiles / TTLB do not work ● The answer is to use: ○ vertical axis: percentiles ○ Horizontal axis: percentage delta of TTLB Visualization of A/B tests 92

Slide 93

Slide 93 text

● All POPs ● All objects ≥ 200KB ● With Rapid Start, TTLB is reduced by 14.7% Note: thawtooth at the lower percentiles are due to the clock granularity being 1ms TTLB Reduction: Global 93

Slide 94

Slide 94 text

94 TTLB Reduction: per-POP ● TTLB reduction: 10.8% ~ 21.5%

Slide 95

Slide 95 text

95 ● Global data for different size bins: 200KB - 400KB / 400KB - 800KB / 800KB - 1.6MB / 1.6MB - 3.2MB ● TTLB reduction: 10.6% (1.6MB - 3.2MB) ~ 14.9% (200KB - 400KB) TTLB Reduction: by Object Size Bin

Slide 96

Slide 96 text

Packet Loss Ratio: Global 96 slow start (baseline) jumpstart rapid- no-jump rapidstart avg. 1.52% 1.61% 1.92% 1.98% P50 0.62% 0.62% 0.90% 0.85% P90 4.36% 4.57% 4.99% 5.06% P99 13.80% 14.22% 14.97% 15.55%

Slide 97

Slide 97 text

97 Packet Loss Ratio: per-POP POP with largest P99 PLR: ● slow start: 19.60% ● jumpstart: 20.05% ● rapid-no-jump: 20.99% ● rapid start: 21.65%

Slide 98

Slide 98 text

98 TTLB Reduction: per-POP ● TTLB reduction: 10.8% ~ 21.5% But why is the shape different for North America? To find an answer, you’d chat with AI and run tens of queries: such iteration is only possible with jrf.

Slide 99

Slide 99 text

Wrap up

Slide 100

Slide 100 text

● To analyze logs, it is paramount to have an inituitive query DSL that runs fast: ○ easy to run ad-hoc queries ○ no need to setup & maintain query infrastructure jrf for fast log analysis 100

Slide 101

Slide 101 text

● To analyze logs, it is paramount to have an inituitive query DSL that runs fast: ○ easy to run ad-hoc queries ○ no need to setup & maintain query infrastructure ● jrf is an NDJSON query program ○ with a DSL based on and extensible using ruby ○ runs as 20x faster than jq ■ 2.6GB/sec on a 10 core CPU jrf for fast log analysis 101

Slide 102

Slide 102 text

● Ruby is a powerful tool for writing DSL executors: ○ the syntax is DSL friendly ○ the entire workflow can be JIT-compiled ○ has highly optimized libraries (e.g., JSON) Ruby for optimized tooling 102

Slide 103

Slide 103 text

● Ruby is a powerful tool for writing DSL executors: ○ the syntax is DSL friendly ○ the entire workflow can be JIT-compiled ○ has highly optimized libraries (e.g., JSON) ● AI has made it much easier to build well-tested DSL executors. Relying on them lets us work at a higher level, improving productivity without having to trust untested AI-written code to do the right thing. Ruby for optimized tooling 103

Slide 104

Slide 104 text

● To visualize network-related performance tests, consider using 2D charts that: ○ for the vertical axis, uses percentile ○ for the horizontal axis, uses delta % from baseline Visualizing network perf tests 104

Slide 105

Slide 105 text

● TLS/1.3 and QUIC reduced handshake latency ● Next step is reducing TTLB: ○ Rapid Start replaces Slow Start, and reduces TTLB by 14.7% globally (>=200KB objects) ○ Ruby was an essential tool for developing Rapid Start Rapid Start 105

Slide 106

Slide 106 text

● TLS/1.3 and QUIC reduced handshake latency ● Next step is reducing TTLB: ○ Rapid Start replaces Slow Start, and reduces TTLB by 14.7% globally (>=200KB objects) ○ Ruby was an essential tool for developing Rapid Start Rapid Start Ruby is making the Web faster! 106