Under the Hood: Inside the DTrace pid Provider

Under the Hood: Inside the DTrace pid Provider Adam Leventhal
Solaris Kernel Development

The pid provider The DTrace instrumentation provider for tracing user-level
instruction Can instrument any running process at any* function entry or return, and at any offset * As long as there are symbols – don't strip that binary!

Previous trace tools Some tools serialize (truss -u) Others are
lossy (DProbes) Neither is acceptable in DTrace Serialization would mean that the pid provider could “scare away” the problems you're trying to observe Lossiness would lead to inconsistent data and invalid conclusions

The basic idea Can't stop the process and we can't
allow a window of lossiness Don't execute the traced instruction at its original address Instead, execute it at some other, thread-specific address: displaced execution

Displaced execution (1/5) How does displaced execution work? Let's say
we want to instrument the first instruction in this sequence: add %o1, 2, %o1 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4]

Displaced execution (2/5) Record the traced instruction in an in-
kernel hash table keyed by address add %o1, 2, %o1 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c

Displaced execution (3/5) Replace the traced instruction with a trapping
instruction. ta 0x38 ! trap srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c

Displaced execution (4/5) When the trap is executed, copy the
instruction to a per-thread address ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c add %o1, 2, %o1

Displaced execution (5/5) Arrange to return to the original instruction
stream (more on that later) ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c add %o1, 2, %o1 <go to next instr>

Displaced execution caveat Not all instructions can be displaced Some
depend on their location e.g. relative branches, call and link Relatively few such instructions to emulate them in the kernel Much simpler than emulating the entire instruction set

Implementation: SPARC SPARC came first (2002) Pros: Fixed-width instructions Easy
to disassemble Delayed control transfers allow for a cool trick Only a handful of instructions to emulate Cons: Turns out the cache architecture isn't designed with the pid provider in mind

Using the %npc (1/3) On SPARC, the pid provider can
use an analogue to instruction picking to implement displaced execution Consider the following: ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] %npc %pc

Using the %npc (2/3) When the user-level thread hits the
trap, %pc is set to the thread-specific region; %npc stays the same ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 %npc %pc

Using the %npc (3/3) After the original instruction is executed,
control flow continues as usual ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 %npc %pc

Cache trouble Need to synchronize the split D$/I$ when the
kernel copies the instruction Do this with flush normally, but there's no flusha Use a block store that ensures cache synchrony (ASI_BLK_COMMIT_S) Requires the thread-specific scratch space be 64-bytes and 64-byte aligned

Implementation: x86 Pros: Fast, cheap, our future, etc. Simple cache
architecture (no SPARC-like cache synchronization problems) Cons: Variable width instructions Huge instruction set Difficult to disassemble

x86 instruction sequence Copy out the original instruction and jump
to the subsequent instruction int $0x3 leal (,%ecx,4), %edx movl%edx, (%ebx) movl%ecx, 0x4(%ebx) add $0x2, %eax jmp <next instr> %eip

Implementation: amd64 Like x86, but with one big exception: %rip-relative
addressing Recall we need to emulate position- dependent instructions With %rip-relative addressing, almost every instruction can potentially be position-dependent

%rip-relative addressing Say the original instruction is movq 0xed8(%rip), %rcx
Move the address of the original instruction into %rax movq 0xed8(%rax), %rcx movq $<old_rax>, %rax jmp 0(%rip) <address of next instr> Modify the instruction to be %rax-relative Copy and execute

Problem with signals Having thread-specific scratch space seems sufficient –
a thread can only be one place at a time However, signals can come at any time even while a thread is executing instructions out of the scratch region We need special handling for synchronous and asynchronous signals

Synchronous signals A synchronous signal is due to an error
when executing an instruction e.g. divide-by-zero, touching unmapped memory If the user-level thread is in its scratch region, reset the PC to the address of the traced instruction When the signal handler returns it will hit the tracepoint again

Asynchronous signals An asynchronous signal is caused by a software
call e.g. a timer expiring, kill(2), pkill(1) If the user-level thread is in its scratch region we need to defer delivery Need to alert the kernel to delivery the signal before leaving the scratch region Implementation is ISA-specific

Asynchronous signals SPARC (1/2) If the thread is in the
scratch region and it gets an asynchronous signal... ta 0x38 srl %o3, 8, %o4 add %o1, 2, %o1 ... ta 0x3a %npc %pc

Asynchronous signals SPARC (2/2) ... point %npc to a trap
in the scratch region to re-enter the kernel later ta 0x38 srl %o3, 8, %o4 add %o1, 2, %o1 ... ta 0x3a %npc %pc

Asynchronous signals x86/amd64 (1/2) Actually copy an alternate instruction sequence
ending in a trap int $0x3 leal (,%ecx,4), %edx add $0x2, %eax jmp <next instr> add $0x2, %eax int $0x7f %eip NB: %eip can be anywhere in the first instruction sequence

Asynchronous signals x86/amd64 (2/2) Move %eip to the relative location
in the second instruction sequence int $0x3 leal (,%ecx,4), %edx add $0x2, %eax jmp <next instr> add $0x2, %eax int $0x7f %eip

Instrumenting a function The DTrace library disassembles traced functions and
creates probes at the desired places This is dicey! Overwriting data with a trap will bring the process to a grinding halt – if you're lucky Need to be very conservative about what we let users trace

Disassembly pitfalls Compilers sometimes put data inside functions' symbol bounds
A favorite type of data is offsets for jump tables – indirect branches usually as the result of a switch statement

Disassembly pitfalls, cont. On SPARC, we're lucky: jump table offsets
are always illegal instructions [-0x8000 .. 0x3fffff] are illegal instructions On x86, interpreting one offset as an instruction invalidates all subsequent disassembly DTrace gives up at the first sign of jump tables Stay tuned for a better answer...

Early tracing This thread-specific scratch region is implemented as a
member of libc's per-thread ulwp_t structure Not there until libc is loaded The runtime linker has some special communication with the kernel so that the thread register set up by the time ld.so.1 executes its first instruction

Performance considerations Installing that trap is not cheap mapin, fault,
mapout What was a single instruction becomes multiple context switches, hundreds of instructions, and cache misses We try to optimize the common cases entry and return probes Emulation is faster than displaced execution

Optimizing SPARC Functions usually start with a save or a
sethi; end with a restore We emulate sethi in the kernel Partially emulate save and restore Stash helpful instructions in the thread structure Set %pc to one of those instructions This saves us the E$ miss

Optimizing x86/amd64 Functions start with a pushl %ebp; end with
a ret Emulate both in the kernel – easy

Beyond the pid provider The pid provider offers extensive coverage
As with fbt for the kernel, it can be hard to grok all those probes For the kernel we invented statically defined tracing providers (SDT) For user-land we have the equivalent

User-land statically defined tracing (USDT) Developers can embed stable, well-
documented probes in applications Users, sys-admins, and service engineers can plug into those probes to understand what the application is doing and how it relates to the system Instrumentation uses the same techniques as the pid provider

Adding a USDT provider Add a call to DTRACE_PROBE*( )
Put a provider description in a .d file: provider oracle { probe transaction-start(id_t id); probe transaction-finish(id_t id); }; Run dtrace -G with object files and provider description Link resulting object file into the binary

USDT providers The plockstat provider offers probes in libc's synchronization
primitives Active work on a hotspot provider for the JVM Work ongoing to make USDT more robust and to add providers to more application from Sun and ISVs

USDT providers, cont. Lot's of possibilities for USDT nscd(1M), fmd(1M),
daemons, complex libraries Try it out – it's very easy Send some mail to the interest list on opensolaris.org Ask a question on the DTrace forum

Under the Hood: Inside the DTrace pid Provider Adam Leventhal
http://blogs.sun.com/ahl

Under the Hood: Inside the DTrace pid Provider

Under the Hood: Inside the DTrace pid Provider

More Decks by Adam Leventhal

Featured

Transcript