R&D Center Base System Development Department
Copyright 2019 Sony Corporation
Debug application inside Kubernetes using Linux Kernel tools
Kenta Tada
R&D Center
Sony Corporation
Slide 2
Slide 2 text
About me
⚫System Software Engineer, Sony
⚫OSS Contributor
• runC
• Docker
• containerd
and so on
Slide 3
Slide 3 text
Agenda
⚫Introduction of oci-ftrace-syscall-analyzer which is our
system call analyzer for Kubernetes
⚫How to get process coredumps on Kubernetes
kernel
tools
kublet
High Level Runtime
(containerd)
Low Level Runtime
(runC)
Pod
(Sample Application)
Node
Master
etcd
kube-apiserver
kube-
scheduler
kube-
controller-
manager
kubectl
user
kernel
Kubernetes and kernel tools
Slide 7
Slide 7 text
Introduction of oci-ftrace-syscall-analyzer
Slide 8
Slide 8 text
Background
⚫We developed the lightweight and secure runC-based
container platform for embedded system
⚫That platform needs to launch secure(restricted and
rootless) containers for third party
⚫We developed the ftrace-based system call analyzer to
generate secure configs too
⚫Currently, we are porting those tools to our Kubernetes
environments
Slide 9
Slide 9 text
Existing debug methods on Kubernetes
⚫Install debug tools
⚫Create the debug image
⚫Prepare for the debug sidecar
Slide 10
Slide 10 text
Use kernel tools to trace applications transparently
⚫Existing methods are very useful but additional packages
and additional capabilities are needed to debug
⚫On the other hand, we just want to investigate system calls
sometimes
• Needed capabilities
• Correct file permissions
• seccomp settings for security
⚫Let’s use kernel tools to trace applications transparently
Slide 11
Slide 11 text
Kernel technologies for tracing
http://mmi.hatenablog.com/entry/2018/03/04/052249
Slide 12
Slide 12 text
Kernel technologies our syscall analyzer used
http://mmi.hatenablog.com/entry/2018/03/04/052249
syscall analyzer
Slide 13
Slide 13 text
Kernel technologies our syscall analyzer used
⚫ftrace
• Tracing framework for the Linux kernel
• ftrace can collect various information although it is typically
considered the function tracer
• Easy to set up(Just write settings to tracefs)
–No eBPF compiler(No LLVM)
⚫Tracepoints
• Static trace points inside kernel
Slide 14
Slide 14 text
What is needed to integrate
1. Divide ftrace ring buffer using ftrace instances for for each
containers
• https://speakerdeck.com/kentatada/container-debug-using-ftrace
2. Set up ftrace inside container startup
today’s
topic
Slide 15
Slide 15 text
Set up ftrace inside container startup
⚫How to insert the ftrace setting tool before container
startup
⚫How to get PID1’s process inside the container
⚫What ftrace settings are needed to trace container’s
processes
Slide 16
Slide 16 text
How to insert the ftrace setting tool before container startup
⚫Container Lifecycle and related hook
⚫Our ftrace-based tracer should be executed at prestart
because we want to trace from the process start like strace
process lifetime
poststart poststop
prestart
process
start
process
stop
Setup
ftrace
Collect
logs
Slide 17
Slide 17 text
How to get PID1’s process inside the container
⚫From OCI runtime spec, the state of the container which
includes container initial PID must be passed to hooks over
stdin
• https://github.com/opencontainers/runtime-
spec/blob/master/config.md
⚫So, we get the info about PID1’s process inside the container
from stdin
⚫This approach can be useful on any low level runtimes if
they comply with OCI runtime spec
Slide 18
Slide 18 text
What ftrace settings are needed to trace container’s processes
⚫Enable system call events which you want to trace
(e.g. From /sys/kernel/debug/tracing/events/syscalls)
⚫Only trace the specified PID
(e.g. # echo [PID] > /sys/kernel/debug/tracing/set_event_pid)
⚫Trace processes which PID of “set_event_pid” forked
(e.g. echo 1 > /sys/kernel/debug/tracing/options/event-fork)
Slide 19
Slide 19 text
Let’s trace ls command inside container
Slide 20
Slide 20 text
kernel
tools
kublet
High Level Runtime
(containerd)
Low Level Runtime
(runC)
Pod
(Sample Application)
Node
Master
etcd
kube-apiserver
kube-
scheduler
kube-
controller-
manager
kubectl
user
kernel
We could integrate runC with ftrace-based syscall analyzer integration
Slide 21
Slide 21 text
How to set up prestart hook in Kubernetes
⚫Kubernetes Pod Lifecycle and related hook
⚫Kubernetes did not provide prestart hook
• https://github.com/kubernetes/kubernetes/issues/140
⚫Next, we investigate prestart hook in the layer of high level
runtime
process lifetime
process
start
process
stop
prestop
poststart
Slide 22
Slide 22 text
How to set up prestart hook in containerd
⚫ In the first place, CRI does not currently provide a way to specify
the hook into the container’s config.json
⚫ High level runtime has their own implementation
⚫ Below is the containerd’s ongoing project
• https://github.com/containerd/cri/pull/1248
• https://github.com/containerd/cri/issues/405
Slide 23
Slide 23 text
How to set up prestart hook in CRI-O
⚫ CRI-O has already provided their own solution "oci-hooks“
• podman has the same feature
⚫ oci-hooks provides a way for users to configure the intended
hooks for Open Container Initiative containers so they will only
be executed for containers that need their functionality, and
then only for the stages where they're needed
https://github.com/containers/libpod/blob/master/pkg/hooks/docs/oci-hooks.5.md
oci-seccomp-bpf-hook
⚫oci-seccomp-bpf-hook generates seccomp profiles by tracing
the syscalls made by the container using eBPF
⚫The perf is used to log syscalls
⚫This tool has a few limitations
• Needs CAP_SYS_ADMIN to run
• Compiles C code on the fly using LLVM
• Cannot use podman run --rm along with this ability
Slide 30
Slide 30 text
Kernel technologies oci-seccomp-bpf-hook used
http://mmi.hatenablog.com/entry/2018/03/04/052249
oci-seccomp-bpf-hook
Slide 31
Slide 31 text
When oci-ftrace-syscall-analyzer is used
⚫Your production system doesn’t want to provide privileges
with users
⚫Your production kernel didn’t prepare for eBPF
configurations
⚫Your production system doesn’t want to use LLVM
• GCC will support the BPF backend?
–Compiling to BPF with GCC : https://lwn.net/Articles/800606/
Slide 32
Slide 32 text
Process coredump on Kubernetes
Slide 33
Slide 33 text
What is the problem?
⚫Process core dump will be recorded at the path of
/proc/sys/kernel/core_pattern
⚫But containers have their own Linux namespace and kernel
does not support Linux namespace
⚫Kubernetes users cannot get their process core dump.
⚫Kubernetes issues
•https://github.com/kubernetes/kubernetes/issues/48787
Slide 34
Slide 34 text
Community approach1 : Modify kernel code
⚫Modify core dump code inside kernel to support Linux
namespace
⚫Patch
• https://lkml.org/lkml/2017/8/2/77
⚫Not merged
Slide 35
Slide 35 text
Community approach2 : Implement add-on for Kubernetes
Not merged too
https://github.com/kubernetes/kubernetes/issues/48787
Slide 36
Slide 36 text
Wrap up
⚫oci-ftrace-syscall-analyzer is seamlessly integrated with
Kubernetes using CRI-O
• https://github.com/KentaTada/oci-ftrace-syscall-analyzer
⚫We should consider how to get process coredump on
Kubernetes
Slide 37
Slide 37 text
Challenging of oci-ftrace-syscall-analyzer
⚫Integrate our tool with containerd
⚫Implement the user space logging facility originated from
our internal container tools
⚫Use kprobes to hook system call to investigate syscall args
⚫Implement seccomp generator
⚫Get rid of unnecessary syscall logs recorded from prestart to
actual runC’s exec
• oci-seccomp-bpf-hook used prctl(2) as starting point. Is it actually
standard??
Slide 38
Slide 38 text
We Are Hiring!!
https://www.sony.co.jp/SonyInfo/Jobs/careers/