Under the Hood:
Inside the DTrace pid Provider
Adam Leventhal
Solaris Kernel Development
Slide 2
Slide 2 text
The pid provider
The DTrace instrumentation provider for
tracing user-level instruction
Can instrument any running process at
any* function entry or return, and at
any offset
* As long as there are symbols – don't strip that binary!
Slide 3
Slide 3 text
Previous trace tools
Some tools serialize (truss -u)
Others are lossy (DProbes)
Neither is acceptable in DTrace
Serialization would mean that the pid provider
could “scare away” the problems you're trying
to observe
Lossiness would lead to inconsistent data and
invalid conclusions
Slide 4
Slide 4 text
The basic idea
Can't stop the process and we can't
allow a window of lossiness
Don't execute the traced instruction at
its original address
Instead, execute it at some other,
thread-specific address: displaced
execution
Slide 5
Slide 5 text
Displaced execution (1/5)
How does displaced execution work?
Let's say we want to instrument the first
instruction in this sequence:
add %o1, 2, %o1
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
Slide 6
Slide 6 text
Displaced execution (2/5)
Record the traced instruction in an in-
kernel hash table keyed by address
add %o1, 2, %o1
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
add %o1, 2, %o1
0x104c
Slide 7
Slide 7 text
Displaced execution (3/5)
Replace the traced instruction with a
trapping instruction.
ta 0x38 ! trap
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
add %o1, 2, %o1
0x104c
Slide 8
Slide 8 text
Displaced execution (4/5)
When the trap is executed, copy the
instruction to a per-thread address
ta 0x38
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
add %o1, 2, %o1
0x104c
add %o1, 2, %o1
Slide 9
Slide 9 text
Displaced execution (5/5)
Arrange to return to the original
instruction stream (more on that later)
ta 0x38
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
add %o1, 2, %o1
0x104c
add %o1, 2, %o1
Slide 10
Slide 10 text
Displaced execution caveat
Not all instructions can be displaced
Some depend on their location
e.g. relative branches, call and link
Relatively few such instructions to
emulate them in the kernel
Much simpler than emulating the entire
instruction set
Slide 11
Slide 11 text
Implementation: SPARC
SPARC came first (2002)
Pros:
Fixed-width instructions
Easy to disassemble
Delayed control transfers allow for a cool trick
Only a handful of instructions to emulate
Cons:
Turns out the cache architecture isn't designed
with the pid provider in mind
Slide 12
Slide 12 text
Using the %npc (1/3)
On SPARC, the pid provider can use an
analogue to instruction picking to
implement displaced execution
Consider the following:
ta 0x38
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
%npc
%pc
Slide 13
Slide 13 text
Using the %npc (2/3)
When the user-level thread hits the trap,
%pc is set to the thread-specific
region; %npc stays the same
ta 0x38
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
add %o1, 2, %o1
%npc
%pc
Slide 14
Slide 14 text
Using the %npc (3/3)
After the original instruction is executed,
control flow continues as usual
ta 0x38
srl %o3, 8, %o4
st %o4, [%o0]
st %o3, [%o0 + 4]
add %o1, 2, %o1
%npc
%pc
Slide 15
Slide 15 text
Cache trouble
Need to synchronize the split D$/I$
when the kernel copies the instruction
Do this with flush normally, but there's
no flusha
Use a block store that ensures cache
synchrony (ASI_BLK_COMMIT_S)
Requires the thread-specific scratch
space be 64-bytes and 64-byte aligned
Slide 16
Slide 16 text
Implementation: x86
Pros:
Fast, cheap, our future, etc.
Simple cache architecture (no SPARC-like cache
synchronization problems)
Cons:
Variable width instructions
Huge instruction set
Difficult to disassemble
Slide 17
Slide 17 text
x86 instruction sequence
Copy out the original instruction and
jump to the subsequent instruction
int $0x3
leal (,%ecx,4), %edx
movl%edx, (%ebx)
movl%ecx, 0x4(%ebx)
add $0x2, %eax
jmp
%eip
Slide 18
Slide 18 text
Implementation: amd64
Like x86, but with one big exception:
%rip-relative addressing
Recall we need to emulate position-
dependent instructions
With %rip-relative addressing, almost
every instruction can potentially be
position-dependent
Slide 19
Slide 19 text
%rip-relative addressing
Say the original
instruction is
movq 0xed8(%rip), %rcx
Move the address
of the original
instruction into
%rax
movq 0xed8(%rax), %rcx
movq $, %rax
jmp 0(%rip)
Modify the
instruction to be
%rax-relative
Copy and execute
Slide 20
Slide 20 text
Problem with signals
Having thread-specific scratch space
seems sufficient – a thread can only be
one place at a time
However, signals can come at any time
even while a thread is executing
instructions out of the scratch region
We need special handling for
synchronous and asynchronous signals
Slide 21
Slide 21 text
Synchronous signals
A synchronous signal is due to an error
when executing an instruction
e.g. divide-by-zero, touching unmapped memory
If the user-level thread is in its scratch
region, reset the PC to the address of
the traced instruction
When the signal handler returns it will
hit the tracepoint again
Slide 22
Slide 22 text
Asynchronous signals
An asynchronous signal is caused by a
software call
e.g. a timer expiring, kill(2), pkill(1)
If the user-level thread is in its scratch
region we need to defer delivery
Need to alert the kernel to delivery the
signal before leaving the scratch region
Implementation is ISA-specific
Slide 23
Slide 23 text
Asynchronous signals
SPARC (1/2)
If the thread is in the scratch region and
it gets an asynchronous signal...
ta 0x38
srl %o3, 8, %o4
add %o1, 2, %o1
...
ta 0x3a
%npc
%pc
Slide 24
Slide 24 text
Asynchronous signals
SPARC (2/2)
... point %npc to a trap in the scratch
region to re-enter the kernel later
ta 0x38
srl %o3, 8, %o4
add %o1, 2, %o1
...
ta 0x3a
%npc
%pc
Slide 25
Slide 25 text
Asynchronous signals
x86/amd64 (1/2)
Actually copy an alternate instruction
sequence ending in a trap
int $0x3
leal (,%ecx,4), %edx
add $0x2, %eax
jmp
add $0x2, %eax
int $0x7f %eip
NB: %eip can be
anywhere in the first
instruction sequence
Slide 26
Slide 26 text
Asynchronous signals
x86/amd64 (2/2)
Move %eip to the relative location in the
second instruction sequence
int $0x3
leal (,%ecx,4), %edx
add $0x2, %eax
jmp
add $0x2, %eax
int $0x7f %eip
Slide 27
Slide 27 text
Instrumenting a function
The DTrace library disassembles traced
functions and creates probes at the
desired places
This is dicey!
Overwriting data with a trap will bring the
process to a grinding halt – if you're lucky
Need to be very conservative about what
we let users trace
Slide 28
Slide 28 text
Disassembly pitfalls
Compilers sometimes put data inside
functions' symbol bounds
A favorite type of data is offsets for jump
tables – indirect branches usually as
the result of a switch statement
Slide 29
Slide 29 text
Disassembly pitfalls, cont.
On SPARC, we're lucky: jump table
offsets are always illegal instructions
[-0x8000 .. 0x3fffff] are illegal instructions
On x86, interpreting one offset as an
instruction invalidates all subsequent
disassembly
DTrace gives up at the first sign of jump tables
Stay tuned for a better answer...
Slide 30
Slide 30 text
Early tracing
This thread-specific scratch region is
implemented as a member of libc's
per-thread ulwp_t structure
Not there until libc is loaded
The runtime linker has some special
communication with the kernel so that
the thread register set up by the time
ld.so.1 executes its first instruction
Slide 31
Slide 31 text
Performance considerations
Installing that trap is not cheap
mapin, fault, mapout
What was a single instruction becomes
multiple context switches, hundreds of
instructions, and cache misses
We try to optimize the common cases
entry and return probes
Emulation is faster than displaced
execution
Slide 32
Slide 32 text
Optimizing SPARC
Functions usually start with a save or a
sethi; end with a restore
We emulate sethi in the kernel
Partially emulate save and restore
Stash helpful instructions in the thread structure
Set %pc to one of those instructions
This saves us the E$ miss
Slide 33
Slide 33 text
Optimizing x86/amd64
Functions start with a pushl %ebp; end
with a ret
Emulate both in the kernel – easy
Slide 34
Slide 34 text
Beyond the pid provider
The pid provider offers extensive
coverage
As with fbt for the kernel, it can be hard
to grok all those probes
For the kernel we invented statically
defined tracing providers (SDT)
For user-land we have the equivalent
Slide 35
Slide 35 text
User-land statically defined
tracing (USDT)
Developers can embed stable, well-
documented probes in applications
Users, sys-admins, and service engineers
can plug into those probes to
understand what the application is
doing and how it relates to the system
Instrumentation uses the same
techniques as the pid provider
Slide 36
Slide 36 text
Adding a USDT provider
Add a call to DTRACE_PROBE*( )
Put a provider description in a .d file:
provider oracle {
probe transaction-start(id_t id);
probe transaction-finish(id_t id);
};
Run dtrace -G with object files and
provider description
Link resulting object file into the binary
Slide 37
Slide 37 text
USDT providers
The plockstat provider offers probes in
libc's synchronization primitives
Active work on a hotspot provider for the
JVM
Work ongoing to make USDT more
robust and to add providers to more
application from Sun and ISVs
Slide 38
Slide 38 text
USDT providers, cont.
Lot's of possibilities for USDT
nscd(1M), fmd(1M), daemons, complex libraries
Try it out – it's very easy
Send some mail to the interest list on
opensolaris.org
Ask a question on the DTrace forum
Slide 39
Slide 39 text
Under the Hood:
Inside the DTrace pid
Provider
Adam Leventhal
http://blogs.sun.com/ahl