SDB: Efficient Ruby Stack Scanning Without the GVL

SDB: Ef fi cient Ruby Stack Scanning Without the GVL
Mike Yang @ 2025-04-16

Why a New Stack Pro fi ler?

Instrumentation & Stack Pro fi ler • Instrumentation as the
default solution: • Insert code around functions to measure latency. • Can’t cover all functions —> blind spots. • Enable a stack pro fi ler when instrumentation doesn’t help. Why not use a stack pro fi ler in the fi rst place?

Why Not Use a Stack Pro fi ler Only? •
Instrumentation • Implementation-dependent • Blind spots • Implementation effort • Maintenance effort • Stack Pro fi ler • Implementation-agnostic • Less blind spots • Less implementation effort • Less maintenance effort

Why Not Use a Stack Pro fi ler Only? •
In practice, this doesn’t happen. • May because most stack pro fi lers: • For average latency • High CPU usage • Slow down Ruby applications due to the GVL

An Instrumentation Replacement • Sampling interval: 10ms → 1ms (or
less). • CPU usage: less than 5% or 3%. • No GVL when scanning.

The Design of a Stack Pro fi ler

A General Ruby Stack Pro fi ler

Analyzer • Aggregates data. • Converts data to other formats.
• Does the heavy work. • Solution: perform lazy and of fl ine analysis.

Symbolizer • Converts memory addresses into readable symbols. • The
cost is manageable if an address isn't converted repeatedly. • Solution: cache the results.

Stack Scanner • Traverses and collects stack data regularly •
Normal live stack can have 100 to 200 frames. • Low impact at low scanning rate, compared to others • But due to the GVL: • slows down applications • limits scanning frequency and accuracy

Scanning the Stack Without the GVL

The Stack Scanner • Scanner reads frames while the Ruby
VM updates them. • The frame is a Ruby internal struct. • The GVL seems to be necessary.

The GVL/Lock • Without a lock, there is no data
integrity guarantee.

The GVL/Lock • Locks sequence critical sections

The GVL/Lock • Lock makes group of instructions atomic. •
What if a single instruction?

Single Field Reading • For example: a counter* without a
lock. • outdated value, no corrupted value • 64-bit aligned memory read and write are atomic[1][2]** • read or write happens or not • no intermediate state • guaranteed by hardware * It only uses an aligned 64-bit memory; it is not an Atomic Counter(no CAS) ** The atomic needs to be guaranteed by both hardware and compiler, some languages, such as Rust, may compile one 64-bit operation into 2 instructions[3]

The Stack Scanner • Only needs one fi eld: *iseq
(address of the ISeq). • The *iseq is a 64-bit aligned memory. • It’s safe to read it without the GVL.

The Stack Lifecycle • The stack is an array of
frames, allocated when a thread is created and freed when it ends. • While a thread is alive, its frames array is valid and safe to read.

The Stack Lifecycle • Wrap code before and after a
thread block. • To synchronize the Ruby VM and the stack scanner.

Work with Ruby GC • GC frees memory of inaccessible
objects. • The scanner only scans live threads.

Work with Ruby GC • Memory compaction moves objects to
different locations. • The stack scanner only reads not movable data: • execution context and its stack • For movable objects • Option 1: read immutable out of GC • Option 2: insert fi ne-grained locks for them

A Stack Scanner Without the GVL • Reading *iseq is
atomic. • Syncs threads state with Ruby VM. • Only reads fi xed objects. Result: SDB stack scanning is safe.

Data Races • Ruby VM updates stack while SDB reads
it. • Not a major issue; remains unsolved. • Similar tools: • rbspy[4], py-spy[5] offer non-blocking mode. • Java async-pro fi ler[6] uses nonatomic function asyncGetCallTrace

Data Races • Actionability[7] > strict correctness. • LDB[8] uses
generation numbers for concurrency control.

Evaluation

Evaluation • The overhead: • SDB vs GVL solution. •
SDB vs different pro fi lers on a Rails application. • Number of samples SDB can gather for a single request.

SDB vs GVL Solution • Setup: • On AWS m5.4xlarge
EC2 using a simple script. • SDB vs a GVL-based solution. • Different sampling intervals. • Measure execution time impact.

SDB vs GVL Solution • sdb-gvl* signal-based solution • Separate
thread (no GVL) triggers signals • Minimal signal handler scans with rb_pro fi le_frames. • Stores method entry addresses in a ring buffer. • No symbolization, logging, or aggregation → minimal overhead.

Evaluation on a Rails Application Target: Homeland (RubyChina Forum), api/v3/topics
API. Testbed: AWS m5.4xlarge, t2.nano, db.t4g.micro. Setup: production env, non-cluster mode, 2 worker threads Sampling Interval: 1 ms Method: Homeland without any stack pro fi lers (1) Homeland with one, such as SDB (2) The overhead = (2) - (1)

Evaluation on a Rails Application • SDB - version: 09e0ffb,
target threads: puma workers. • Edited Vernier 1.0[15] - version: 1.0, target threads: puma workers, the busy pull has been replaced by usleep(1000), no memory allocation. • rbspy[4] - Version: 2e7fd57, target threads: all threads, mode: non-blocking.

Samples of One Request • Request duration: 27.99ms • Samples
count: 27 • Methods count: 564 • The Methods have duration: 237

Does the Effort Pay Off? • SDB Stack Scanner: <
300 lines of Rust. • SDB Symbolizer: < 300 lines of C, 100 lines of Python. • It requires an understanding of: • The Ruby GVL, GC and ISeq • Low-level concurrency • Design • No GVL introduces additional effort.

Current State and Future Work • SDB is an experiment
project • Support more Ruby versions • Rewrite the symbolizer without eBPF • Integrate with more Ruby internal events

Related Work

Related Works in Ruby • People may use an instrumentation
library as the default solution, such as NewRelic Ruby Agent[9], Datadog Ruby Client[10], or OpenTelemetry[11]. • As instrumentation always has blind spots, people may use stack pro fi lers (Vernier[12], rbspy[4], Pf2[13]) or dynamic instrumentation (such as Datadog Dynamic Instrumentation[14]) as a complement. • Ruby stack pro fi lers may choose a low sampling rate to reduce impact due to the GVL.

Related Works • Observability solutions aim to: • Reduce human
effort—minimizing manual instrumentation and operations. • Collect more data for easier and more effective analysis. • Maintain low performance overhead and resource usage. Many tools focus on service-level, end-to-end latency, as it is easier to achieve and provides the most bene fi ts. Dapper[16] employs RPC libraries for tracing service latency, with similar tools including Jaeger[17], Pinpoint[18], and X-Trace[19]. DeepFlow[20] reduces manual effort by instrumenting kernel network functions via eBPF. However, these tools lack visibility inside services and rely on complementary tools like perf[21], stack pro fi lers[22], and bcc[23]. Performance and cost remain challenges—Dapper, for instance, uses sampling, but inconsistent cross-service sampling makes full trace reconstruction dif fi cult. Tail Sampling[24] and Consistent Probability Sampling[25] help mitigate this issue. Canopy[26] introduces fl exible sampling policies to balance data collection and cost.

Related Works In observability, more data reduces analysis effort. NSight[27]
leverages Intel PT to capture all instructions, simplifying root cause detection but generating massive data (up to 1GB/s). SHIM[28] and LDB[8] use busy-polling with lower overhead, yet LDB still consumes one CPU per process. However, in performance debugging, we do not need to track the latency of all functions —only those that dominate a request or transaction. Based on this, SDB is designed for Rails applications, using a default 1ms sampling interval, which can detect functions with 2ms or more latency—suf fi cient for most Rails applications. For example, since a typical Rails request exceeds 20ms, identifying functions contributing at least 10% of the request time is often enough. At this sampling rate, SDB requires minimal CPU usage.

Related Works Many tools automate trace collection. Lprof[29] analyzes bytecode
to aid runtime log-based fl ow construction. Domino[30] interposes JavaScript callback registration for event chain tracking. Stitch[31] builds system stack graphs using log object identi fi ers. Minder[32] applies unsupervised learning and similarity-based checks for fault detection. SDB focuses on visibility inside Ruby applications, providing richer insights with minimal manual effort.

Conclusions • A stack pro fi ler can release the
GVL and it is bene fi cial. • SDB provides high sampling rate with low performance impact and CPU usage. • SDB has the potential to be a default observability solution.

Thank You! SDB is available at github.com/yfractal/sdb

References 1. MIT 6.1810 2024 L22: Multi-Core scalability and RCU
https://pdos.csail.mit.edu/6.1810/2024/lec/l-rcu.txt 2. Ruby Memory Model https://docs.google.com/document/d/1pVzU8w_QF44YzUCCab990Q_WZOdhpKolCIHaiXG-sPw/edit?tab=t.0 3. Rust Atomics and Locks Low-Level Concurrency in Practice 4. rbspy: Sampling CPU pro fi ler for Ruby https://github.com/rbspy/rbspy 5. py-spy: Sampling pro fi ler for Python programs https://github.com/benfred/py-spy 6. async-pro fi ler: Sampling CPU and HEAP pro fi ler for Java https://github.com/async-pro fi ler/async-pro fi ler 7. Evaluating the accuracy of java pro fi lers. PLDI, 2010. 8. LDB: An Ef fi cient Latency Pro fi ling Tool for Multithreaded Applications. NSDI, 2024. 9. New Relic RPM Ruby Agent https://github.com/newrelic/newrelic-ruby-agent 10. Datadog Tracing Ruby Client https://github.com/DataDog/dd-trace-rb 11. OpenTelemetry Ruby https://github.com/open-telemetry/opentelemetry-ruby 12. Vernier: next generation CRuby pro fi ler https://github.com/jhawthorn/vernier 13. Pf2: A sampling-based pro fi ler for Ruby https://github.com/osyoyu/pf2 14. Datadog Dynamic Instrumentation https://docs.datadoghq.com/dynamic_instrumentation/ 15. Edited Vernier 1.0 https://github.com/yfractal/vernier/tree/v1.0.0-patch 16. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc, 2010.

References 17. Jaeger: open source, end-to-end distributed tracing, 2022. https://www.jaegertracing.io/.
18. Pinpoint: problem determination in large, dynamic internet services. DSN, 2002. 19. X-Trace: A Pervasive Network Tracing Framework. NSDI 2007. 20. Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code. SIGCOMM, 2023. 21. perf: Linux pro fi ling with performance counters, 2022. https://perf.wiki.kernel.org/index.php/Main_Page. 22. Google-Wide Pro fi ling: A Continuous Pro fi ling Infrastructure for Data Centers. IEEE Micro, 2010. 23. BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more. https://github.com/iovisor/bcc 24. Tail Sampling with OpenTelemetry: Why it’s useful, how to do it, and what to consider https://opentelemetry.io/blog/2022/tail-sampling/ 25. TraceState: Probability Sampling https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling-experimental/ 26. Canopy: An End-to-End Performance Tracing And Analysis System. SOSP, 2017. 27. How to diagnose nanosecond network latencies in rich endhost stacks. NSDI, 2022. 28. Computer performance microscopy with shim. ISCA, 2015. 29. Lprof: A Non-Intrusive Request Flow Pro fi ler for Distributed Systems. OSDI, 2014. 30. Domino: Understanding Wide-Area, Asynchronous Event Causality in Web Applications. SoCC, 2015 31. NonIntrusive Performance Pro fi ling for Entire Software Stacks Based on the Flow Reconstruction Principle. OSDI 2016 32. Minder: Faulty Machine Detection for Large-scale Distributed Model Training. NSDI, 2025.

Synthetic Ruby as a Small Gift

SDB: Efficient Ruby Stack Scanning Without the GVL

SDB: Efficient Ruby Stack Scanning Without the GVL

Featured

Transcript