Slide 1

Slide 1 text

A ScyllaDB Community DTrace at 21: Reflections on Fully-grown Software Bryan Cantrill CTO, Oxide Computer Company

Slide 2

Slide 2 text

OXIDE DTrace in adulthood • On September 3, 2003, we integrated DTrace into Solaris • DTrace became open source in 2005, and has found its way into quite a few systems (e.g., Mac and Windows) – and has influenced many more • DTrace is broadly done and firmly in adulthood – and we’ve been lucky enough to be using it more or less continuously for its entire life • The benefit of hindsight allows us to reflect on the stuff we got right, the things we figured out – and the ways in which we got lucky

Slide 3

Slide 3 text

OXIDE We got right: Focus on production systems • From the outset, our focus for DTrace was on production systems: we knew that production systems have performance and other pathologies that cannot be readily reproduced elsewhere! • To be acceptable for production, instrumentation has to be absolutely safe above all else: misuse cannot result in system failure! • Facility has to always be available – can’t rely on recompiling anything, downloading binaries, downloading symbol tables, etc. • These constraints led to deep integration with the operating system

Slide 4

Slide 4 text

OXIDE We got right: Dynamic instrumentation • We believed from the outset that instrumentation should be dynamic • Part of this comes from the production constraint: it was essential that DTrace have zero probe effect when disabled • We also felt strongly that we should be able to instrument software that had had no modifications to support it – and in arbitrary contexts • And because we wanted to also replace several existing tools, we separated out the methodology of instrumentation from the framework that consumed them

Slide 5

Slide 5 text

OXIDE We got right: Organizational approach • We had been thinking about DTrace long before we started – and over the years, integrated the foundation that we knew we would need • But it couldn’t be done as a side-project – we needed to focus • Instead of attempting to timeline an entire project, we made the case in late 2001 to allow for two of us to focus for six months • After six months we had the proof points to add a third engineer – and to allow us to remain focussed on it for several more years • The team was always very small (three people!) and not in an office

Slide 6

Slide 6 text

OXIDE We figured out: A domain specific language • We knew we wanted to have expressive power in the actions taken on instrumentation, but something we figured out early was the need to have our own domain specific language • We were heavily inspired in syntax by AWK, a little language that continues to have an outsized influence • We dubbed our language “D” (and its intermediate form “DIF”), not knowing that Walter Bright was concurrently working on a language of the same name!

Slide 7

Slide 7 text

OXIDE We figured out: A domain specific virtual machine • We had assumed that we would transpile DIF to native instructions for execution, but it became quickly clear that executing DIF in a domain specific virtual machine in the kernel would be a huge win • Not only did this allow us to move quickly by adding powerful concepts to DIF (e.g., thread-local variables), it allowed us to achieve the safety constraint with extensive run-time checks • We assure completion of DIF by making it Turing incomplete – DIF has no backwards branches!

Slide 8

Slide 8 text

OXIDE We got right: We used it ourselves • We used DTrace heavily ourselves – and used it as early as possible in its development to find issues in the operating system and beyond • This led to a more robust system – and one’s whose emphasis is utility • Every feature in DTrace is a direct consequence of concrete need! • This has led to many features that may seem arcane – but they are only arcane until you need them: thread-local variables, anonymous tracing, speculative tracing, postmortem tracing, arbitrary instruction tracing, etc.

Slide 9

Slide 9 text

OXIDE We figured out: Statically-defined tracing • While the origin of DTrace is dynamic (our early tagline was “Concise answers to arbitrary questions”), it is overwhelming to deal with the implementation of the system to ask • We saw the need for statically-defined tracing (SDT) at points of semantic interest (CPU scheduling, performing I/O, etc.) • SDT probes coupled with type information via CTF and structure translators allowed for true interface stability, allowing users to instrument in terms of semantics and not implementation

Slide 10

Slide 10 text

OXIDE We figured out: Application-level instrumentation • When we set out, it was with a focus on kernel-level instrumentation • Kernel-level instrumentation is necessary not just for kernel-level issues, but also to observe system-wide effects of application-level issues… • …but also insufficient: we needed application-level instrumentation • This is especially valuable with statically-defined tracing; user-level SDT (USDT) allows for programs themselves to be instrumented in a semantically stable fashion by their own users • USDT is essential at Oxide! See https://github.com/oxidecomputer/usdt

Slide 11

Slide 11 text

OXIDE We got right: Writing our own documentation • In 2003, documentation was really the only way to learn how to use a system – especially one that was sophisticated and proprietary • We made the deliberate decision to write all of our own documentation • The DTrace documentation (https://illumos.org/books/dtrace) was all written by the three DTrace engineers: the documentation was authoritative, canonical – and we found many bugs in writing the docs!

Slide 12

Slide 12 text

OXIDE We figured out: Writing a paper • In 2003 (right after the integration!), we attended an academic conference (AADEBUG – RIP!) which inspired us to write a canonical academic paper on DTrace • The resulting paper, Dynamic Instrumentation of Production Systems, was presented at the USENIX Annual Technical Conference in 2004 • We wrote a broader paper for practitioners in ACM Queue in 2006, Hidden in Plain Sight, that made the case for software observability • There is tremendous value in rigorously describing your ideas!

Slide 13

Slide 13 text

OXIDE We got lucky: Open source • DTrace was born an entirely proprietary system, but there had been conversations internally about open sourcing the operating system as early as 1997 – and by 2003, there was urgency around it • Open sourcing big, proprietary software is not easy – and if DTrace had remained proprietary, it would have died • We got very lucky that Sun not only prioritized open sourcing the operating system, but led that with DTrace first (in January 2005)

Slide 14

Slide 14 text

OXIDE We got lucky: Ports to other systems • DTrace is not easy – it is very tightly integrated with the operating system, and depends on many OS facilities • Initially ported to FreeBSD by the late John Birrell – and then to MacOS, QNX, Linux, and Windows • These ports required significant effort by veteran technologists! • The different ports have taken different liberties (e.g., the DTrace port to Linux now uses eBPF as a backend), but all shared by the goal of dynamic instrumentation in production

Slide 15

Slide 15 text

OXIDE We got lucky: DTrace endures • We feel lucky to still be using DTrace everyday – and especially to be using it on our thorniest problems! • We feel lucky to be bringing new people into DTrace (older than DTrace itself – but not by much!), new languages (Rust!) and new systems • We feel lucky that it’s open source, which granted DTrace eternal life • We feel lucky to still be working together – join Adam Leventhal and me on our podcast Oxide and Friends!

Slide 16

Slide 16 text

Bryan Cantrill [email protected] @bcantrill{.bsky.social,.mastodon.social} https://oxide.computer