Slide 1

Slide 1 text

DTrace in the Non-global Zone Bryan Cantrill SVP Engineering, Joyent @bcantrill [email protected]

Slide 2

Slide 2 text

DTrace and zones: Fraternal twins •DTrace and zones were developed in parallel during development of Solaris 10 •DTrace integrated (September 2003) before zones (early 2004) •When zones integrated, the priority was making DTrace in the global zone be able to meaningfully instrument non-global zones •DTrace in the non-global zone was hard — and a lower priority than other work on both technologies

Slide 3

Slide 3 text

DTrace and zones: Basic functionality •In 2006, Dan Price (with help from Adam Leventhal and Jonathan Adams) added initial support for DTrace in the non-global zone •Allowed use of syscall provider, pid provider and (in a deranged, broken way) the profile provider •This was significant work: required modifications to both the zones privilege model and the DTrace privilege model •For example, required an implicit predicate on syscall and profile probes

Slide 4

Slide 4 text

DTrace and zones in SmartOS •As the worldʼs heaviest user of zones, we at Joyent ran into (and fixed) a number of annoying bugs: •USDT probes from the non-global were not properly being enabled in the global zone (illumos#908) •Tick and profile probes did not properly fire when used in the non-global zone (illumos#1456) •Fixing the latter required an extension of the DTrace privilege model: introduced a notion of restricted operation in which args could not be referenced

Slide 5

Slide 5 text

DTrace and zones in SmartOS •Other (very) annoying issues still lurked: •Inability to read “cpu” in the non-global zone •Inability to read any fields from “curlwpsinfo” and “curpsinfo”— especially “pr_dmodel” •Inability to read the “fds[]” array •Failure mode highly obnoxious: [my-non-global-zone ~]# dtrace -n BEGIN'{trace(curpsinfo->pr_psargs)}' dtrace: description 'BEGIN' matched 1 probe dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 44

Slide 6

Slide 6 text

Divide and conquer •curlwpsinfo and curpsinfo both are translators over the current thread (“kthread_t”) and current process (“proc_t”) •Importantly, the state contained in oneʼs own kthread_t and proc_t: •Is safe to read while executing (threads cannot disappear out from under themselves) •Does not represent potential privilege escalation •This can be fixed by simply allowing the loads where one has privileges to the current process!

Slide 7

Slide 7 text

fds[]: A magic bullet? •Somehow, I convinced myself that the problem with fds[] was the translator that translates the member accesses into kernel accesses: inline fileinfo_t fds[int fd] = xlate ( fd >= 0 && fd < t_procp->p_user.u_finfo.fi_nfiles ? curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL); •If the problem was the static translators, the solution must be dynamic translators — a(n in)famously unimplemented feature of DTrace! •After dtrace.conf(12), I realized that the expression was orthogonal to the fact that the in-kernel implementation must not allow privilege escalation

Slide 8

Slide 8 text

fds[]: No magic bullets •Focussing on the implementation, allows one to consider the specifics of the fds[] case •Helped by the fact that the fi_list implementation uses memory retiring for scalability of file descriptor lookups: the array is only freed upon process exit •Assures that oneʼs own fi_list is always pointing to memory that is (or was) an array of uf_entry_t •Leaves the file_t itself, which can be freed during probe context (specifically, by another thread in the same process)

Slide 9

Slide 9 text

Dealing with file_t •We can deal with this by forcing everyone out of probe context after a file_t has been removed from the uf_entry_t, but before being freed •This is done by issuing a dtrace_sync() — a synchronous (empty) cross-call to all CPUs •This is expensive, and required answering an important question: just how hot is the closef() path, anyway? •By instrumenting our guinea pigs production cloud, we could answer this concisely: closef() is pretty damned hot (> 5,000/second on some machines!)

Slide 10

Slide 10 text

Adding getf() •To track when fds[] was active in the non-global zone, we added a getf() subroutine (ht: ken) •Allows us to issue the sync only when we have a closef() from a non-global zone using fds[] •Had to take the final step of cleaning up the path output to strip off the zone path from the file name (as a cleanliness issue, not a security issue) •De-mo, de-mo, de-mo!

Slide 11

Slide 11 text

sched and proc providers •With fds[] done, focus turned the only meaningful impediment to DTrace in the non-global zone: enabling the sched and proc providers •Recall the restricted operation introduced for the profile provider in the non-global zone... •Used this to have limited (non-global) DTrace privileges imply restricted operation for some SDT providers •Thanks to the curlwpsinfo/curpsinfo work, these providers can be meaningfully used without access to arguments

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Thank you. FOR MORE INFORMATION VISIT www.joyent.com OR www.smartos.org