$30 off During Our Annual Pro Sale. View Details »

DTrace in the Non-global Zone

DTrace in the Non-global Zone

My presentation at the BayLISA SmartOS meetup on August 16th, 2012. This talk was videoed, but has since been made unavailable. If it helps anyone recover it, the video was here: https://www.youtube.com/watch?v=atyvcYbY6Ic

Bryan Cantrill

August 17, 2022
Tweet

More Decks by Bryan Cantrill

Other Decks in Technology

Transcript

  1. DTrace in the Non-global Zone Bryan Cantrill SVP Engineering, Joyent

    @bcantrill bryan@joyent.com
  2. DTrace and zones: Fraternal twins •DTrace and zones were developed

    in parallel during development of Solaris 10 •DTrace integrated (September 2003) before zones (early 2004) •When zones integrated, the priority was making DTrace in the global zone be able to meaningfully instrument non-global zones •DTrace in the non-global zone was hard — and a lower priority than other work on both technologies
  3. DTrace and zones: Basic functionality •In 2006, Dan Price (with

    help from Adam Leventhal and Jonathan Adams) added initial support for DTrace in the non-global zone •Allowed use of syscall provider, pid provider and (in a deranged, broken way) the profile provider •This was significant work: required modifications to both the zones privilege model and the DTrace privilege model •For example, required an implicit predicate on syscall and profile probes
  4. DTrace and zones in SmartOS •As the worldʼs heaviest user

    of zones, we at Joyent ran into (and fixed) a number of annoying bugs: •USDT probes from the non-global were not properly being enabled in the global zone (illumos#908) •Tick and profile probes did not properly fire when used in the non-global zone (illumos#1456) •Fixing the latter required an extension of the DTrace privilege model: introduced a notion of restricted operation in which args could not be referenced
  5. DTrace and zones in SmartOS •Other (very) annoying issues still

    lurked: •Inability to read “cpu” in the non-global zone •Inability to read any fields from “curlwpsinfo” and “curpsinfo”— especially “pr_dmodel” •Inability to read the “fds[]” array •Failure mode highly obnoxious: [my-non-global-zone ~]# dtrace -n BEGIN'{trace(curpsinfo->pr_psargs)}' dtrace: description 'BEGIN' matched 1 probe dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 44
  6. Divide and conquer •curlwpsinfo and curpsinfo both are translators over

    the current thread (“kthread_t”) and current process (“proc_t”) •Importantly, the state contained in oneʼs own kthread_t and proc_t: •Is safe to read while executing (threads cannot disappear out from under themselves) •Does not represent potential privilege escalation •This can be fixed by simply allowing the loads where one has privileges to the current process!
  7. fds[]: A magic bullet? •Somehow, I convinced myself that the

    problem with fds[] was the translator that translates the member accesses into kernel accesses: inline fileinfo_t fds[int fd] = xlate ( fd >= 0 && fd < t_procp->p_user.u_finfo.fi_nfiles ? curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL); •If the problem was the static translators, the solution must be dynamic translators — a(n in)famously unimplemented feature of DTrace! •After dtrace.conf(12), I realized that the expression was orthogonal to the fact that the in-kernel implementation must not allow privilege escalation
  8. fds[]: No magic bullets •Focussing on the implementation, allows one

    to consider the specifics of the fds[] case •Helped by the fact that the fi_list implementation uses memory retiring for scalability of file descriptor lookups: the array is only freed upon process exit •Assures that oneʼs own fi_list is always pointing to memory that is (or was) an array of uf_entry_t •Leaves the file_t itself, which can be freed during probe context (specifically, by another thread in the same process)
  9. Dealing with file_t •We can deal with this by forcing

    everyone out of probe context after a file_t has been removed from the uf_entry_t, but before being freed •This is done by issuing a dtrace_sync() — a synchronous (empty) cross-call to all CPUs •This is expensive, and required answering an important question: just how hot is the closef() path, anyway? •By instrumenting our guinea pigs production cloud, we could answer this concisely: closef() is pretty damned hot (> 5,000/second on some machines!)
  10. Adding getf() •To track when fds[] was active in the

    non-global zone, we added a getf() subroutine (ht: ken) •Allows us to issue the sync only when we have a closef() from a non-global zone using fds[] •Had to take the final step of cleaning up the path output to strip off the zone path from the file name (as a cleanliness issue, not a security issue) •De-mo, de-mo, de-mo!
  11. sched and proc providers •With fds[] done, focus turned the

    only meaningful impediment to DTrace in the non-global zone: enabling the sched and proc providers •Recall the restricted operation introduced for the profile provider in the non-global zone... •Used this to have limited (non-global) DTrace privileges imply restricted operation for some SDT providers •Thanks to the curlwpsinfo/curpsinfo work, these providers can be meaningfully used without access to arguments
  12. None
  13. Thank you. FOR MORE INFORMATION VISIT www.joyent.com OR www.smartos.org