Under the Hood: Inside the DTrace pid Provider

by Adam Leventhal

Slide 1

Slide 1 text

Under the Hood: Inside the DTrace pid Provider Adam Leventhal Solaris Kernel Development

Slide 2

Slide 2 text

The pid provider The DTrace instrumentation provider for tracing user-level instruction Can instrument any running process at any* function entry or return, and at any offset * As long as there are symbols – don't strip that binary!

Slide 3

Slide 3 text

Previous trace tools Some tools serialize (truss -u) Others are lossy (DProbes) Neither is acceptable in DTrace Serialization would mean that the pid provider could “scare away” the problems you're trying to observe Lossiness would lead to inconsistent data and invalid conclusions

Slide 4

Slide 4 text

The basic idea Can't stop the process and we can't allow a window of lossiness Don't execute the traced instruction at its original address Instead, execute it at some other, thread-specific address: displaced execution

Slide 5

Slide 5 text

Displaced execution (1/5) How does displaced execution work? Let's say we want to instrument the first instruction in this sequence: add %o1, 2, %o1 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4]

Slide 6

Slide 6 text

Displaced execution (2/5) Record the traced instruction in an in- kernel hash table keyed by address add %o1, 2, %o1 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c

Slide 7

Slide 7 text

Displaced execution (3/5) Replace the traced instruction with a trapping instruction. ta 0x38 ! trap srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c

Slide 8

Slide 8 text

Displaced execution (4/5) When the trap is executed, copy the instruction to a per-thread address ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c add %o1, 2, %o1

Slide 9

Slide 9 text

Displaced execution (5/5) Arrange to return to the original instruction stream (more on that later) ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c add %o1, 2, %o1

Slide 10

Slide 10 text

Displaced execution caveat Not all instructions can be displaced Some depend on their location e.g. relative branches, call and link Relatively few such instructions to emulate them in the kernel Much simpler than emulating the entire instruction set

Slide 11

Slide 11 text

Implementation: SPARC SPARC came first (2002) Pros: Fixed-width instructions Easy to disassemble Delayed control transfers allow for a cool trick Only a handful of instructions to emulate Cons: Turns out the cache architecture isn't designed with the pid provider in mind

Slide 12

Slide 12 text

Using the %npc (1/3) On SPARC, the pid provider can use an analogue to instruction picking to implement displaced execution Consider the following: ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] %npc %pc

Slide 13

Slide 13 text

Using the %npc (2/3) When the user-level thread hits the trap, %pc is set to the thread-specific region; %npc stays the same ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 %npc %pc

Slide 14

Slide 14 text

Using the %npc (3/3) After the original instruction is executed, control flow continues as usual ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 %npc %pc

Slide 15

Slide 15 text

Cache trouble Need to synchronize the split D$/I$ when the kernel copies the instruction Do this with flush normally, but there's no flusha Use a block store that ensures cache synchrony (ASI_BLK_COMMIT_S) Requires the thread-specific scratch space be 64-bytes and 64-byte aligned

Slide 16

Slide 16 text

Implementation: x86 Pros: Fast, cheap, our future, etc. Simple cache architecture (no SPARC-like cache synchronization problems) Cons: Variable width instructions Huge instruction set Difficult to disassemble

Slide 17

Slide 17 text

x86 instruction sequence Copy out the original instruction and jump to the subsequent instruction int $0x3 leal (,%ecx,4), %edx movl%edx, (%ebx) movl%ecx, 0x4(%ebx) add $0x2, %eax jmp %eip

Slide 18

Slide 18 text

Implementation: amd64 Like x86, but with one big exception: %rip-relative addressing Recall we need to emulate position- dependent instructions With %rip-relative addressing, almost every instruction can potentially be position-dependent

Slide 19

Slide 19 text

%rip-relative addressing Say the original instruction is movq 0xed8(%rip), %rcx Move the address of the original instruction into %rax movq 0xed8(%rax), %rcx movq $, %rax jmp 0(%rip)

Modify the instruction to be %rax-relative Copy and execute

Slide 20

Slide 20 text

Problem with signals Having thread-specific scratch space seems sufficient – a thread can only be one place at a time However, signals can come at any time even while a thread is executing instructions out of the scratch region We need special handling for synchronous and asynchronous signals

Slide 21

Slide 21 text

Synchronous signals A synchronous signal is due to an error when executing an instruction e.g. divide-by-zero, touching unmapped memory If the user-level thread is in its scratch region, reset the PC to the address of the traced instruction When the signal handler returns it will hit the tracepoint again

Slide 22

Slide 22 text

Asynchronous signals An asynchronous signal is caused by a software call e.g. a timer expiring, kill(2), pkill(1) If the user-level thread is in its scratch region we need to defer delivery Need to alert the kernel to delivery the signal before leaving the scratch region Implementation is ISA-specific

Slide 23

Slide 23 text

Asynchronous signals SPARC (1/2) If the thread is in the scratch region and it gets an asynchronous signal... ta 0x38 srl %o3, 8, %o4 add %o1, 2, %o1 ... ta 0x3a %npc %pc

Slide 24

Slide 24 text

Asynchronous signals SPARC (2/2) ... point %npc to a trap in the scratch region to re-enter the kernel later ta 0x38 srl %o3, 8, %o4 add %o1, 2, %o1 ... ta 0x3a %npc %pc

Slide 25

Slide 25 text

Asynchronous signals x86/amd64 (1/2) Actually copy an alternate instruction sequence ending in a trap int $0x3 leal (,%ecx,4), %edx add $0x2, %eax jmp add $0x2, %eax int $0x7f %eip NB: %eip can be anywhere in the first instruction sequence

Slide 26

Slide 26 text

Asynchronous signals x86/amd64 (2/2) Move %eip to the relative location in the second instruction sequence int $0x3 leal (,%ecx,4), %edx add $0x2, %eax jmp add $0x2, %eax int $0x7f %eip

Slide 27

Slide 27 text

Instrumenting a function The DTrace library disassembles traced functions and creates probes at the desired places This is dicey! Overwriting data with a trap will bring the process to a grinding halt – if you're lucky Need to be very conservative about what we let users trace

Slide 28

Slide 28 text

Disassembly pitfalls Compilers sometimes put data inside functions' symbol bounds A favorite type of data is offsets for jump tables – indirect branches usually as the result of a switch statement

Slide 29

Slide 29 text

Disassembly pitfalls, cont. On SPARC, we're lucky: jump table offsets are always illegal instructions [-0x8000 .. 0x3fffff] are illegal instructions On x86, interpreting one offset as an instruction invalidates all subsequent disassembly DTrace gives up at the first sign of jump tables Stay tuned for a better answer...

Slide 30

Slide 30 text

Early tracing This thread-specific scratch region is implemented as a member of libc's per-thread ulwp_t structure Not there until libc is loaded The runtime linker has some special communication with the kernel so that the thread register set up by the time ld.so.1 executes its first instruction

Slide 31

Slide 31 text

Performance considerations Installing that trap is not cheap mapin, fault, mapout What was a single instruction becomes multiple context switches, hundreds of instructions, and cache misses We try to optimize the common cases entry and return probes Emulation is faster than displaced execution

Slide 32

Slide 32 text

Optimizing SPARC Functions usually start with a save or a sethi; end with a restore We emulate sethi in the kernel Partially emulate save and restore Stash helpful instructions in the thread structure Set %pc to one of those instructions This saves us the E$ miss

Slide 33

Slide 33 text

Optimizing x86/amd64 Functions start with a pushl %ebp; end with a ret Emulate both in the kernel – easy

Slide 34

Slide 34 text

Beyond the pid provider The pid provider offers extensive coverage As with fbt for the kernel, it can be hard to grok all those probes For the kernel we invented statically defined tracing providers (SDT) For user-land we have the equivalent

Slide 35

Slide 35 text

User-land statically defined tracing (USDT) Developers can embed stable, well- documented probes in applications Users, sys-admins, and service engineers can plug into those probes to understand what the application is doing and how it relates to the system Instrumentation uses the same techniques as the pid provider

Slide 36

Slide 36 text

Adding a USDT provider Add a call to DTRACE_PROBE*( ) Put a provider description in a .d file: provider oracle { probe transaction-start(id_t id); probe transaction-finish(id_t id); }; Run dtrace -G with object files and provider description Link resulting object file into the binary

Slide 37

Slide 37 text

USDT providers The plockstat provider offers probes in libc's synchronization primitives Active work on a hotspot provider for the JVM Work ongoing to make USDT more robust and to add providers to more application from Sun and ISVs

Slide 38

Slide 38 text

USDT providers, cont. Lot's of possibilities for USDT nscd(1M), fmd(1M), daemons, complex libraries Try it out – it's very easy Send some mail to the interest list on opensolaris.org Ask a question on the DTrace forum

Slide 39

Slide 39 text

Under the Hood: Inside the DTrace pid Provider Adam Leventhal http://blogs.sun.com/ahl