Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Under the Hood: Inside the DTrace pid Provider

Adam Leventhal
March 01, 2005
15

Under the Hood: Inside the DTrace pid Provider

Adam Leventhal

March 01, 2005
Tweet

Transcript

  1. The pid provider The DTrace instrumentation provider for tracing user-level

    instruction Can instrument any running process at any* function entry or return, and at any offset * As long as there are symbols – don't strip that binary!
  2. Previous trace tools Some tools serialize (truss -u) Others are

    lossy (DProbes) Neither is acceptable in DTrace Serialization would mean that the pid provider could “scare away” the problems you're trying to observe Lossiness would lead to inconsistent data and invalid conclusions
  3. The basic idea Can't stop the process and we can't

    allow a window of lossiness Don't execute the traced instruction at its original address Instead, execute it at some other, thread-specific address: displaced execution
  4. Displaced execution (1/5) How does displaced execution work? Let's say

    we want to instrument the first instruction in this sequence: add %o1, 2, %o1 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4]
  5. Displaced execution (2/5) Record the traced instruction in an in-

    kernel hash table keyed by address add %o1, 2, %o1 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c
  6. Displaced execution (3/5) Replace the traced instruction with a trapping

    instruction. ta 0x38 ! trap srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c
  7. Displaced execution (4/5) When the trap is executed, copy the

    instruction to a per-thread address ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c add %o1, 2, %o1
  8. Displaced execution (5/5) Arrange to return to the original instruction

    stream (more on that later) ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 0x104c add %o1, 2, %o1 <go to next instr>
  9. Displaced execution caveat Not all instructions can be displaced Some

    depend on their location e.g. relative branches, call and link Relatively few such instructions to emulate them in the kernel Much simpler than emulating the entire instruction set
  10. Implementation: SPARC SPARC came first (2002) Pros: Fixed-width instructions Easy

    to disassemble Delayed control transfers allow for a cool trick Only a handful of instructions to emulate Cons: Turns out the cache architecture isn't designed with the pid provider in mind
  11. Using the %npc (1/3) On SPARC, the pid provider can

    use an analogue to instruction picking to implement displaced execution Consider the following: ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] %npc %pc
  12. Using the %npc (2/3) When the user-level thread hits the

    trap, %pc is set to the thread-specific region; %npc stays the same ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 %npc %pc
  13. Using the %npc (3/3) After the original instruction is executed,

    control flow continues as usual ta 0x38 srl %o3, 8, %o4 st %o4, [%o0] st %o3, [%o0 + 4] add %o1, 2, %o1 %npc %pc
  14. Cache trouble Need to synchronize the split D$/I$ when the

    kernel copies the instruction Do this with flush normally, but there's no flusha Use a block store that ensures cache synchrony (ASI_BLK_COMMIT_S) Requires the thread-specific scratch space be 64-bytes and 64-byte aligned
  15. Implementation: x86 Pros: Fast, cheap, our future, etc. Simple cache

    architecture (no SPARC-like cache synchronization problems) Cons: Variable width instructions Huge instruction set Difficult to disassemble
  16. x86 instruction sequence Copy out the original instruction and jump

    to the subsequent instruction int $0x3 leal (,%ecx,4), %edx movl%edx, (%ebx) movl%ecx, 0x4(%ebx) add $0x2, %eax jmp <next instr> %eip
  17. Implementation: amd64 Like x86, but with one big exception: %rip-relative

    addressing Recall we need to emulate position- dependent instructions With %rip-relative addressing, almost every instruction can potentially be position-dependent
  18. %rip-relative addressing Say the original instruction is movq 0xed8(%rip), %rcx

    Move the address of the original instruction into %rax movq 0xed8(%rax), %rcx movq $<old_rax>, %rax jmp 0(%rip) <address of next instr> Modify the instruction to be %rax-relative Copy and execute
  19. Problem with signals Having thread-specific scratch space seems sufficient –

    a thread can only be one place at a time However, signals can come at any time even while a thread is executing instructions out of the scratch region We need special handling for synchronous and asynchronous signals
  20. Synchronous signals A synchronous signal is due to an error

    when executing an instruction e.g. divide-by-zero, touching unmapped memory If the user-level thread is in its scratch region, reset the PC to the address of the traced instruction When the signal handler returns it will hit the tracepoint again
  21. Asynchronous signals An asynchronous signal is caused by a software

    call e.g. a timer expiring, kill(2), pkill(1) If the user-level thread is in its scratch region we need to defer delivery Need to alert the kernel to delivery the signal before leaving the scratch region Implementation is ISA-specific
  22. Asynchronous signals SPARC (1/2) If the thread is in the

    scratch region and it gets an asynchronous signal... ta 0x38 srl %o3, 8, %o4 add %o1, 2, %o1 ... ta 0x3a %npc %pc
  23. Asynchronous signals SPARC (2/2) ... point %npc to a trap

    in the scratch region to re-enter the kernel later ta 0x38 srl %o3, 8, %o4 add %o1, 2, %o1 ... ta 0x3a %npc %pc
  24. Asynchronous signals x86/amd64 (1/2) Actually copy an alternate instruction sequence

    ending in a trap int $0x3 leal (,%ecx,4), %edx add $0x2, %eax jmp <next instr> add $0x2, %eax int $0x7f %eip NB: %eip can be anywhere in the first instruction sequence
  25. Asynchronous signals x86/amd64 (2/2) Move %eip to the relative location

    in the second instruction sequence int $0x3 leal (,%ecx,4), %edx add $0x2, %eax jmp <next instr> add $0x2, %eax int $0x7f %eip
  26. Instrumenting a function The DTrace library disassembles traced functions and

    creates probes at the desired places This is dicey! Overwriting data with a trap will bring the process to a grinding halt – if you're lucky Need to be very conservative about what we let users trace
  27. Disassembly pitfalls Compilers sometimes put data inside functions' symbol bounds

    A favorite type of data is offsets for jump tables – indirect branches usually as the result of a switch statement
  28. Disassembly pitfalls, cont. On SPARC, we're lucky: jump table offsets

    are always illegal instructions [-0x8000 .. 0x3fffff] are illegal instructions On x86, interpreting one offset as an instruction invalidates all subsequent disassembly DTrace gives up at the first sign of jump tables Stay tuned for a better answer...
  29. Early tracing This thread-specific scratch region is implemented as a

    member of libc's per-thread ulwp_t structure Not there until libc is loaded The runtime linker has some special communication with the kernel so that the thread register set up by the time ld.so.1 executes its first instruction
  30. Performance considerations Installing that trap is not cheap mapin, fault,

    mapout What was a single instruction becomes multiple context switches, hundreds of instructions, and cache misses We try to optimize the common cases entry and return probes Emulation is faster than displaced execution
  31. Optimizing SPARC Functions usually start with a save or a

    sethi; end with a restore We emulate sethi in the kernel Partially emulate save and restore Stash helpful instructions in the thread structure Set %pc to one of those instructions This saves us the E$ miss
  32. Optimizing x86/amd64 Functions start with a pushl %ebp; end with

    a ret Emulate both in the kernel – easy
  33. Beyond the pid provider The pid provider offers extensive coverage

    As with fbt for the kernel, it can be hard to grok all those probes For the kernel we invented statically defined tracing providers (SDT) For user-land we have the equivalent
  34. User-land statically defined tracing (USDT) Developers can embed stable, well-

    documented probes in applications Users, sys-admins, and service engineers can plug into those probes to understand what the application is doing and how it relates to the system Instrumentation uses the same techniques as the pid provider
  35. Adding a USDT provider Add a call to DTRACE_PROBE*( )

    Put a provider description in a .d file: provider oracle { probe transaction-start(id_t id); probe transaction-finish(id_t id); }; Run dtrace -G with object files and provider description Link resulting object file into the binary
  36. USDT providers The plockstat provider offers probes in libc's synchronization

    primitives Active work on a hotspot provider for the JVM Work ongoing to make USDT more robust and to add providers to more application from Sun and ISVs
  37. USDT providers, cont. Lot's of possibilities for USDT nscd(1M), fmd(1M),

    daemons, complex libraries Try it out – it's very easy Send some mail to the interest list on opensolaris.org Ask a question on the DTrace forum