Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DTrace in the Non-global Zone

DTrace in the Non-global Zone

My presentation at the BayLISA SmartOS meetup on August 16th, 2012. This talk was videoed, but has since been made unavailable. If it helps anyone recover it, the video was here: https://www.youtube.com/watch?v=atyvcYbY6Ic

Bryan Cantrill

August 17, 2022
Tweet

More Decks by Bryan Cantrill

Other Decks in Technology

Transcript

  1. DTrace in the
    Non-global Zone
    Bryan Cantrill
    SVP Engineering, Joyent
    @bcantrill
    [email protected]

    View Slide

  2. DTrace and zones: Fraternal twins
    •DTrace and zones were developed in parallel during
    development of Solaris 10
    •DTrace integrated (September 2003) before zones
    (early 2004)
    •When zones integrated, the priority was making
    DTrace in the global zone be able to meaningfully
    instrument non-global zones
    •DTrace in the non-global zone was hard — and a
    lower priority than other work on both technologies

    View Slide

  3. DTrace and zones: Basic functionality
    •In 2006, Dan Price (with help from Adam Leventhal
    and Jonathan Adams) added initial support for
    DTrace in the non-global zone
    •Allowed use of syscall provider, pid provider and (in
    a deranged, broken way) the profile provider
    •This was significant work: required modifications to
    both the zones privilege model and the DTrace
    privilege model
    •For example, required an implicit predicate on
    syscall and profile probes

    View Slide

  4. DTrace and zones in SmartOS
    •As the worldʼs heaviest user of zones, we at Joyent
    ran into (and fixed) a number of annoying bugs:
    •USDT probes from the non-global were not
    properly being enabled in the global zone
    (illumos#908)
    •Tick and profile probes did not properly fire when
    used in the non-global zone (illumos#1456)
    •Fixing the latter required an extension of the DTrace
    privilege model: introduced a notion of restricted
    operation in which args could not be referenced

    View Slide

  5. DTrace and zones in SmartOS
    •Other (very) annoying issues still lurked:
    •Inability to read “cpu” in the non-global zone
    •Inability to read any fields from “curlwpsinfo”
    and “curpsinfo”— especially “pr_dmodel”
    •Inability to read the “fds[]” array
    •Failure mode highly obnoxious:
    [my-non-global-zone ~]# dtrace -n BEGIN'{trace(curpsinfo->pr_psargs)}'
    dtrace: description 'BEGIN' matched 1 probe
    dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel
    access in action #1 at DIF offset 44

    View Slide

  6. Divide and conquer
    •curlwpsinfo and curpsinfo both are translators
    over the current thread (“kthread_t”) and current
    process (“proc_t”)
    •Importantly, the state contained in oneʼs own
    kthread_t and proc_t:
    •Is safe to read while executing (threads cannot
    disappear out from under themselves)
    •Does not represent potential privilege escalation
    •This can be fixed by simply allowing the loads where
    one has privileges to the current process!

    View Slide

  7. fds[]: A magic bullet?
    •Somehow, I convinced myself that the problem with
    fds[] was the translator that translates the member
    accesses into kernel accesses:
    inline fileinfo_t fds[int fd] = xlate (
    fd >= 0 && fd < t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);
    •If the problem was the static translators, the solution
    must be dynamic translators — a(n in)famously
    unimplemented feature of DTrace!
    •After dtrace.conf(12), I realized that the expression
    was orthogonal to the fact that the in-kernel
    implementation must not allow privilege escalation

    View Slide

  8. fds[]: No magic bullets
    •Focussing on the implementation, allows one to
    consider the specifics of the fds[] case
    •Helped by the fact that the fi_list implementation
    uses memory retiring for scalability of file descriptor
    lookups: the array is only freed upon process exit
    •Assures that oneʼs own fi_list is always pointing
    to memory that is (or was) an array of uf_entry_t
    •Leaves the file_t itself, which can be freed during
    probe context (specifically, by another thread in the
    same process)

    View Slide

  9. Dealing with file_t
    •We can deal with this by forcing everyone out of
    probe context after a file_t has been removed
    from the uf_entry_t, but before being freed
    •This is done by issuing a dtrace_sync() — a
    synchronous (empty) cross-call to all CPUs
    •This is expensive, and required answering an
    important question: just how hot is the closef()
    path, anyway?
    •By instrumenting our guinea pigs production cloud,
    we could answer this concisely: closef() is pretty
    damned hot (> 5,000/second on some machines!)

    View Slide

  10. Adding getf()
    •To track when fds[] was active in the non-global
    zone, we added a getf() subroutine (ht: ken)
    •Allows us to issue the sync only when we have a
    closef() from a non-global zone using fds[]
    •Had to take the final step of cleaning up the path
    output to strip off the zone path from the file name
    (as a cleanliness issue, not a security issue)
    •De-mo, de-mo, de-mo!

    View Slide

  11. sched and proc providers
    •With fds[] done, focus turned the only meaningful
    impediment to DTrace in the non-global zone:
    enabling the sched and proc providers
    •Recall the restricted operation introduced for the
    profile provider in the non-global zone...
    •Used this to have limited (non-global) DTrace
    privileges imply restricted operation for some SDT
    providers
    •Thanks to the curlwpsinfo/curpsinfo work,
    these providers can be meaningfully used without
    access to arguments

    View Slide

  12. View Slide

  13. Thank you.
    FOR MORE INFORMATION VISIT
    www.joyent.com
    OR
    www.smartos.org

    View Slide