Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

SDB: Ef fi cient Ruby Stack Scanning Without the GVL Mike Yang @ 2025-04-16

Slide 3

Slide 3 text

Why a New Stack Pro fi ler?

Slide 4

Slide 4 text

Instrumentation & Stack Pro fi ler • Instrumentation as the default solution: • Insert code around functions to measure latency. • Can’t cover all functions —> blind spots. • Enable a stack pro fi ler when instrumentation doesn’t help. Why not use a stack pro fi ler in the fi rst place?

Slide 5

Slide 5 text

Why Not Use a Stack Pro fi ler Only? • Instrumentation • Implementation-dependent • Blind spots • Implementation effort • Maintenance effort • Stack Pro fi ler • Implementation-agnostic • Less blind spots • Less implementation effort • Less maintenance effort

Slide 6

Slide 6 text

Why Not Use a Stack Pro fi ler Only? • In practice, this doesn’t happen. • May because most stack pro fi lers: • For average latency • High CPU usage • Slow down Ruby applications due to the GVL

Slide 7

Slide 7 text

An Instrumentation Replacement • Sampling interval: 10ms → 1ms (or less). • CPU usage: less than 5% or 3%. • No GVL when scanning.

Slide 8

Slide 8 text

The Design of a Stack Pro fi ler

Slide 9

Slide 9 text

A General Ruby Stack Pro fi ler

Slide 10

Slide 10 text

Analyzer • Aggregates data. • Converts data to other formats. • Does the heavy work. • Solution: perform lazy and of fl ine analysis.

Slide 11

Slide 11 text

Symbolizer • Converts memory addresses into readable symbols. • The cost is manageable if an address isn't converted repeatedly. • Solution: cache the results.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Stack Scanner • Traverses and collects stack data regularly • Normal live stack can have 100 to 200 frames. • Low impact at low scanning rate, compared to others • But due to the GVL: • slows down applications • limits scanning frequency and accuracy

Slide 14

Slide 14 text

Scanning the Stack Without the GVL

Slide 15

Slide 15 text

The Stack Scanner • Scanner reads frames while the Ruby VM updates them. • The frame is a Ruby internal struct. • The GVL seems to be necessary.

Slide 16

Slide 16 text

The GVL/Lock • Without a lock, there is no data integrity guarantee.

Slide 17

Slide 17 text

The GVL/Lock • Locks sequence critical sections

Slide 18

Slide 18 text

The GVL/Lock • Lock makes group of instructions atomic. • What if a single instruction?

Slide 19

Slide 19 text

Single Field Reading • For example: a counter* without a lock. • outdated value, no corrupted value • 64-bit aligned memory read and write are atomic[1][2]** • read or write happens or not • no intermediate state • guaranteed by hardware * It only uses an aligned 64-bit memory; it is not an Atomic Counter(no CAS) ** The atomic needs to be guaranteed by both hardware and compiler, some languages, such as Rust, may compile one 64-bit operation into 2 instructions[3]

Slide 20

Slide 20 text

The Stack Scanner • Only needs one fi eld: *iseq (address of the ISeq). • The *iseq is a 64-bit aligned memory. • It’s safe to read it without the GVL.

Slide 21

Slide 21 text

The Stack Lifecycle • The stack is an array of frames, allocated when a thread is created and freed when it ends. • While a thread is alive, its frames array is valid and safe to read.

Slide 22

Slide 22 text

The Stack Lifecycle • Wrap code before and after a thread block. • To synchronize the Ruby VM and the stack scanner.

Slide 23

Slide 23 text

Work with Ruby GC • GC frees memory of inaccessible objects. • The scanner only scans live threads.

Slide 24

Slide 24 text

Work with Ruby GC • Memory compaction moves objects to different locations. • The stack scanner only reads not movable data: • execution context and its stack • For movable objects • Option 1: read immutable out of GC • Option 2: insert fi ne-grained locks for them

Slide 25

Slide 25 text

A Stack Scanner Without the GVL • Reading *iseq is atomic. • Syncs threads state with Ruby VM. • Only reads fi xed objects. Result: SDB stack scanning is safe.

Slide 26

Slide 26 text

Data Races • Ruby VM updates stack while SDB reads it. • Not a major issue; remains unsolved. • Similar tools: • rbspy[4], py-spy[5] offer non-blocking mode. • Java async-pro fi ler[6] uses nonatomic function asyncGetCallTrace

Slide 27

Slide 27 text

Data Races • Actionability[7] > strict correctness. • LDB[8] uses generation numbers for concurrency control.

Slide 28

Slide 28 text

Evaluation

Slide 29

Slide 29 text

Evaluation • The overhead: • SDB vs GVL solution. • SDB vs different pro fi lers on a Rails application. • Number of samples SDB can gather for a single request.

Slide 30

Slide 30 text

SDB vs GVL Solution • Setup: • On AWS m5.4xlarge EC2 using a simple script. • SDB vs a GVL-based solution. • Different sampling intervals. • Measure execution time impact.

Slide 31

Slide 31 text

SDB vs GVL Solution • sdb-gvl* signal-based solution • Separate thread (no GVL) triggers signals • Minimal signal handler scans with rb_pro fi le_frames. • Stores method entry addresses in a ring buffer. • No symbolization, logging, or aggregation → minimal overhead.

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Evaluation on a Rails Application Target: Homeland (RubyChina Forum), api/v3/topics API. Testbed: AWS m5.4xlarge, t2.nano, db.t4g.micro. Setup: production env, non-cluster mode, 2 worker threads Sampling Interval: 1 ms Method: Homeland without any stack pro fi lers (1) Homeland with one, such as SDB (2) The overhead = (2) - (1)

Slide 34

Slide 34 text

Evaluation on a Rails Application • SDB - version: 09e0ffb, target threads: puma workers. • Edited Vernier 1.0[15] - version: 1.0, target threads: puma workers, the busy pull has been replaced by usleep(1000), no memory allocation. • rbspy[4] - Version: 2e7fd57, target threads: all threads, mode: non-blocking.

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Samples of One Request • Request duration: 27.99ms • Samples count: 27 • Methods count: 564 • The Methods have duration: 237

Slide 37

Slide 37 text

Does the Effort Pay Off? • SDB Stack Scanner: < 300 lines of Rust. • SDB Symbolizer: < 300 lines of C, 100 lines of Python. • It requires an understanding of: • The Ruby GVL, GC and ISeq • Low-level concurrency • Design • No GVL introduces additional effort.

Slide 38

Slide 38 text

Current State and Future Work • SDB is an experiment project • Support more Ruby versions • Rewrite the symbolizer without eBPF • Integrate with more Ruby internal events

Slide 39

Slide 39 text

Related Work

Slide 40

Slide 40 text

Related Works in Ruby • People may use an instrumentation library as the default solution, such as NewRelic Ruby Agent[9], Datadog Ruby Client[10], or OpenTelemetry[11]. • As instrumentation always has blind spots, people may use stack pro fi lers (Vernier[12], rbspy[4], Pf2[13]) or dynamic instrumentation (such as Datadog Dynamic Instrumentation[14]) as a complement. • Ruby stack pro fi lers may choose a low sampling rate to reduce impact due to the GVL.

Slide 41

Slide 41 text

Related Works • Observability solutions aim to: • Reduce human effort—minimizing manual instrumentation and operations. • Collect more data for easier and more effective analysis. • Maintain low performance overhead and resource usage. Many tools focus on service-level, end-to-end latency, as it is easier to achieve and provides the most bene fi ts. Dapper[16] employs RPC libraries for tracing service latency, with similar tools including Jaeger[17], Pinpoint[18], and X-Trace[19]. DeepFlow[20] reduces manual effort by instrumenting kernel network functions via eBPF. However, these tools lack visibility inside services and rely on complementary tools like perf[21], stack pro fi lers[22], and bcc[23]. Performance and cost remain challenges—Dapper, for instance, uses sampling, but inconsistent cross-service sampling makes full trace reconstruction dif fi cult. Tail Sampling[24] and Consistent Probability Sampling[25] help mitigate this issue. Canopy[26] introduces fl exible sampling policies to balance data collection and cost.

Slide 42

Slide 42 text

Related Works In observability, more data reduces analysis effort. NSight[27] leverages Intel PT to capture all instructions, simplifying root cause detection but generating massive data (up to 1GB/s). SHIM[28] and LDB[8] use busy-polling with lower overhead, yet LDB still consumes one CPU per process. However, in performance debugging, we do not need to track the latency of all functions —only those that dominate a request or transaction. Based on this, SDB is designed for Rails applications, using a default 1ms sampling interval, which can detect functions with 2ms or more latency—suf fi cient for most Rails applications. For example, since a typical Rails request exceeds 20ms, identifying functions contributing at least 10% of the request time is often enough. At this sampling rate, SDB requires minimal CPU usage.

Slide 43

Slide 43 text

Related Works Many tools automate trace collection. Lprof[29] analyzes bytecode to aid runtime log-based fl ow construction. Domino[30] interposes JavaScript callback registration for event chain tracking. Stitch[31] builds system stack graphs using log object identi fi ers. Minder[32] applies unsupervised learning and similarity-based checks for fault detection. SDB focuses on visibility inside Ruby applications, providing richer insights with minimal manual effort.

Slide 44

Slide 44 text

Conclusions • A stack pro fi ler can release the GVL and it is bene fi cial. • SDB provides high sampling rate with low performance impact and CPU usage. • SDB has the potential to be a default observability solution.

Slide 45

Slide 45 text

Thank You! SDB is available at github.com/yfractal/sdb

Slide 46

Slide 46 text

References 1. MIT 6.1810 2024 L22: Multi-Core scalability and RCU https://pdos.csail.mit.edu/6.1810/2024/lec/l-rcu.txt 2. Ruby Memory Model https://docs.google.com/document/d/1pVzU8w_QF44YzUCCab990Q_WZOdhpKolCIHaiXG-sPw/edit?tab=t.0 3. Rust Atomics and Locks Low-Level Concurrency in Practice 4. rbspy: Sampling CPU pro fi ler for Ruby https://github.com/rbspy/rbspy 5. py-spy: Sampling pro fi ler for Python programs https://github.com/benfred/py-spy 6. async-pro fi ler: Sampling CPU and HEAP pro fi ler for Java https://github.com/async-pro fi ler/async-pro fi ler 7. Evaluating the accuracy of java pro fi lers. PLDI, 2010. 8. LDB: An Ef fi cient Latency Pro fi ling Tool for Multithreaded Applications. NSDI, 2024. 9. New Relic RPM Ruby Agent https://github.com/newrelic/newrelic-ruby-agent 10. Datadog Tracing Ruby Client https://github.com/DataDog/dd-trace-rb 11. OpenTelemetry Ruby https://github.com/open-telemetry/opentelemetry-ruby 12. Vernier: next generation CRuby pro fi ler https://github.com/jhawthorn/vernier 13. Pf2: A sampling-based pro fi ler for Ruby https://github.com/osyoyu/pf2 14. Datadog Dynamic Instrumentation https://docs.datadoghq.com/dynamic_instrumentation/ 15. Edited Vernier 1.0 https://github.com/yfractal/vernier/tree/v1.0.0-patch 16. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc, 2010.

Slide 47

Slide 47 text

References 17. Jaeger: open source, end-to-end distributed tracing, 2022. https://www.jaegertracing.io/. 18. Pinpoint: problem determination in large, dynamic internet services. DSN, 2002. 19. X-Trace: A Pervasive Network Tracing Framework. NSDI 2007. 20. Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code. SIGCOMM, 2023. 21. perf: Linux pro fi ling with performance counters, 2022. https://perf.wiki.kernel.org/index.php/Main_Page. 22. Google-Wide Pro fi ling: A Continuous Pro fi ling Infrastructure for Data Centers. IEEE Micro, 2010. 23. BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more. https://github.com/iovisor/bcc 24. Tail Sampling with OpenTelemetry: Why it’s useful, how to do it, and what to consider https://opentelemetry.io/blog/2022/tail-sampling/ 25. TraceState: Probability Sampling https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling-experimental/ 26. Canopy: An End-to-End Performance Tracing And Analysis System. SOSP, 2017. 27. How to diagnose nanosecond network latencies in rich endhost stacks. NSDI, 2022. 28. Computer performance microscopy with shim. ISCA, 2015. 29. Lprof: A Non-Intrusive Request Flow Pro fi ler for Distributed Systems. OSDI, 2014. 30. Domino: Understanding Wide-Area, Asynchronous Event Causality in Web Applications. SoCC, 2015 31. NonIntrusive Performance Pro fi ling for Entire Software Stacks Based on the Flow Reconstruction Principle. OSDI 2016 32. Minder: Faulty Machine Detection for Large-scale Distributed Model Training. NSDI, 2025.

Slide 48

Slide 48 text

Synthetic Ruby as a Small Gift