Slide 1

Slide 1 text

R&D Center Base System Development Department Copyright 2019 Sony Corporation Debug application inside Kubernetes using Linux Kernel tools Kenta Tada R&D Center Sony Corporation

Slide 2

Slide 2 text

About me ⚫System Software Engineer, Sony ⚫OSS Contributor • runC • Docker • containerd and so on

Slide 3

Slide 3 text

Agenda ⚫Introduction of oci-ftrace-syscall-analyzer which is our system call analyzer for Kubernetes ⚫How to get process coredumps on Kubernetes

Slide 4

Slide 4 text

Overview of Kubernetes

Slide 5

Slide 5 text

kublet High Level Runtime (containerd) Low Level Runtime (runC) Pod (Sample Application) Node Master etcd kube-apiserver kube- scheduler kube- controller- manager kubectl Kubernetes

Slide 6

Slide 6 text

kernel tools kublet High Level Runtime (containerd) Low Level Runtime (runC) Pod (Sample Application) Node Master etcd kube-apiserver kube- scheduler kube- controller- manager kubectl user kernel Kubernetes and kernel tools

Slide 7

Slide 7 text

Introduction of oci-ftrace-syscall-analyzer

Slide 8

Slide 8 text

Background ⚫We developed the lightweight and secure runC-based container platform for embedded system ⚫That platform needs to launch secure(restricted and rootless) containers for third party ⚫We developed the ftrace-based system call analyzer to generate secure configs too ⚫Currently, we are porting those tools to our Kubernetes environments

Slide 9

Slide 9 text

Existing debug methods on Kubernetes ⚫Install debug tools ⚫Create the debug image ⚫Prepare for the debug sidecar

Slide 10

Slide 10 text

Use kernel tools to trace applications transparently ⚫Existing methods are very useful but additional packages and additional capabilities are needed to debug ⚫On the other hand, we just want to investigate system calls sometimes • Needed capabilities • Correct file permissions • seccomp settings for security ⚫Let’s use kernel tools to trace applications transparently

Slide 11

Slide 11 text

Kernel technologies for tracing http://mmi.hatenablog.com/entry/2018/03/04/052249

Slide 12

Slide 12 text

Kernel technologies our syscall analyzer used http://mmi.hatenablog.com/entry/2018/03/04/052249 syscall analyzer

Slide 13

Slide 13 text

Kernel technologies our syscall analyzer used ⚫ftrace • Tracing framework for the Linux kernel • ftrace can collect various information although it is typically considered the function tracer • Easy to set up(Just write settings to tracefs) –No eBPF compiler(No LLVM) ⚫Tracepoints • Static trace points inside kernel

Slide 14

Slide 14 text

What is needed to integrate 1. Divide ftrace ring buffer using ftrace instances for for each containers • https://speakerdeck.com/kentatada/container-debug-using-ftrace 2. Set up ftrace inside container startup today’s topic

Slide 15

Slide 15 text

Set up ftrace inside container startup ⚫How to insert the ftrace setting tool before container startup ⚫How to get PID1’s process inside the container ⚫What ftrace settings are needed to trace container’s processes

Slide 16

Slide 16 text

How to insert the ftrace setting tool before container startup ⚫Container Lifecycle and related hook ⚫Our ftrace-based tracer should be executed at prestart because we want to trace from the process start like strace process lifetime poststart poststop prestart process start process stop Setup ftrace Collect logs

Slide 17

Slide 17 text

How to get PID1’s process inside the container ⚫From OCI runtime spec, the state of the container which includes container initial PID must be passed to hooks over stdin • https://github.com/opencontainers/runtime- spec/blob/master/config.md ⚫So, we get the info about PID1’s process inside the container from stdin ⚫This approach can be useful on any low level runtimes if they comply with OCI runtime spec

Slide 18

Slide 18 text

What ftrace settings are needed to trace container’s processes ⚫Enable system call events which you want to trace (e.g. From /sys/kernel/debug/tracing/events/syscalls) ⚫Only trace the specified PID (e.g. # echo [PID] > /sys/kernel/debug/tracing/set_event_pid) ⚫Trace processes which PID of “set_event_pid” forked (e.g. echo 1 > /sys/kernel/debug/tracing/options/event-fork)

Slide 19

Slide 19 text

Let’s trace ls command inside container

Slide 20

Slide 20 text

kernel tools kublet High Level Runtime (containerd) Low Level Runtime (runC) Pod (Sample Application) Node Master etcd kube-apiserver kube- scheduler kube- controller- manager kubectl user kernel We could integrate runC with ftrace-based syscall analyzer integration

Slide 21

Slide 21 text

How to set up prestart hook in Kubernetes ⚫Kubernetes Pod Lifecycle and related hook ⚫Kubernetes did not provide prestart hook • https://github.com/kubernetes/kubernetes/issues/140 ⚫Next, we investigate prestart hook in the layer of high level runtime process lifetime process start process stop prestop poststart

Slide 22

Slide 22 text

How to set up prestart hook in containerd ⚫ In the first place, CRI does not currently provide a way to specify the hook into the container’s config.json ⚫ High level runtime has their own implementation ⚫ Below is the containerd’s ongoing project • https://github.com/containerd/cri/pull/1248 • https://github.com/containerd/cri/issues/405

Slide 23

Slide 23 text

How to set up prestart hook in CRI-O ⚫ CRI-O has already provided their own solution "oci-hooks“ • podman has the same feature ⚫ oci-hooks provides a way for users to configure the intended hooks for Open Container Initiative containers so they will only be executed for containers that need their functionality, and then only for the stages where they're needed https://github.com/containers/libpod/blob/master/pkg/hooks/docs/oci-hooks.5.md

Slide 24

Slide 24 text

https://github.com/containers/libpod/blob/master/pkg/hooks/docs/oci-hooks.5.md

Slide 25

Slide 25 text

CRI-O oci-hooks prestart example { "version": "1.0.0", "hook": { "path": "/usr/local/bin/oci-ftrace-syscall-analyzer", "args": ["oci-ftrace-syscall-analyzer"] }, "when": { "always": true }, "stages": [ "prestart" ] }

Slide 26

Slide 26 text

Demo : Get syscall logs of redis Pod

Slide 27

Slide 27 text

kernel tools kublet High Level Runtime (containerd) Low Level Runtime (runC) Pod (Sample Application) Node Master etcd kube-apiserver kube- scheduler kube- controller- manager kubectl user kernel Our integration is done!! integration

Slide 28

Slide 28 text

Related tools ⚫oci-seccomp-bpf-hook • https://github.com/containers/oci-seccomp-bpf-hook • eBPF-based seccomp generator ⚫kubectl-trace • https://github.com/iovisor/kubectl-trace today’s topic

Slide 29

Slide 29 text

oci-seccomp-bpf-hook ⚫oci-seccomp-bpf-hook generates seccomp profiles by tracing the syscalls made by the container using eBPF ⚫The perf is used to log syscalls ⚫This tool has a few limitations • Needs CAP_SYS_ADMIN to run • Compiles C code on the fly using LLVM • Cannot use podman run --rm along with this ability

Slide 30

Slide 30 text

Kernel technologies oci-seccomp-bpf-hook used http://mmi.hatenablog.com/entry/2018/03/04/052249 oci-seccomp-bpf-hook

Slide 31

Slide 31 text

When oci-ftrace-syscall-analyzer is used ⚫Your production system doesn’t want to provide privileges with users ⚫Your production kernel didn’t prepare for eBPF configurations ⚫Your production system doesn’t want to use LLVM • GCC will support the BPF backend? –Compiling to BPF with GCC : https://lwn.net/Articles/800606/

Slide 32

Slide 32 text

Process coredump on Kubernetes

Slide 33

Slide 33 text

What is the problem? ⚫Process core dump will be recorded at the path of /proc/sys/kernel/core_pattern ⚫But containers have their own Linux namespace and kernel does not support Linux namespace ⚫Kubernetes users cannot get their process core dump. ⚫Kubernetes issues •https://github.com/kubernetes/kubernetes/issues/48787

Slide 34

Slide 34 text

Community approach1 : Modify kernel code ⚫Modify core dump code inside kernel to support Linux namespace ⚫Patch • https://lkml.org/lkml/2017/8/2/77 ⚫Not merged

Slide 35

Slide 35 text

Community approach2 : Implement add-on for Kubernetes Not merged too https://github.com/kubernetes/kubernetes/issues/48787

Slide 36

Slide 36 text

Wrap up ⚫oci-ftrace-syscall-analyzer is seamlessly integrated with Kubernetes using CRI-O • https://github.com/KentaTada/oci-ftrace-syscall-analyzer ⚫We should consider how to get process coredump on Kubernetes

Slide 37

Slide 37 text

Challenging of oci-ftrace-syscall-analyzer ⚫Integrate our tool with containerd ⚫Implement the user space logging facility originated from our internal container tools ⚫Use kprobes to hook system call to investigate syscall args ⚫Implement seccomp generator ⚫Get rid of unnecessary syscall logs recorded from prestart to actual runC’s exec • oci-seccomp-bpf-hook used prctl(2) as starting point. Is it actually standard??

Slide 38

Slide 38 text

We Are Hiring!! https://www.sony.co.jp/SonyInfo/Jobs/careers/

Slide 39

Slide 39 text

SONYはソニー株式会社の登録商標または商標です。 各ソニー製品の商品名・サービス名はソニー株式会社またはグループ各社の登録商標または商標です。その他の製品および会社名は、各社の商号、登録商標または商標です。