Slide 1

Slide 1 text

audit, falco, ... and eBPF! Uchio Kondo @ GMO Pepabo, Inc.
 #CNDK2019 Tracing the Containers Image from pixabay: https://pixabay.com/images/id-984050/

Slide 2

Slide 2 text

Señor-Principal Engineer @ GMO Pepabo, Inc. Uchio Kondo https://blog.udzura.jp/ @udzura Technical department, Dev Productivity/R&D Team Chair on CNDJ at Fukuoka, 2019.04 Systems programmer wannabe Duolingo freak (Emerald League)

Slide 3

Slide 3 text

JapanContainerDays 2018.12 •CRIU

Slide 4

Slide 4 text

CNDF 2019 Spring

Slide 5

Slide 5 text

CNDT 2019 summer •cgroup v2
 & PSI

Slide 6

Slide 6 text

Intertested: •Container features in Linux Kernel (namespace, cgroup, capability, ...) •System calls •Kernel programming interfaces •eBPF (<= New!!) •The most favorite struct: struct task_struct

Slide 7

Slide 7 text

Today

Slide 8

Slide 8 text

ToC •Rough overview of Container tracing (5m~) •Introducing to eBPF •Comparison to existing tracers •Kernel events (~ 5m) •Use cases with some DEMO (~ 10m)

Slide 9

Slide 9 text

Tracing Your containers

Slide 10

Slide 10 text

Why tracing? •τϨʔεʹ͸ҎԼͷΑ͏ͳ໨త͕͋Δ •ϩΪϯά: ෳࡶͳΞϓϦέʔγϣϯͰԿ͕͓͖͍ͯΔ͔೺Ѳ •؂ࠪɾηΩϡϦςΟ: ඞཁͳτϨʔεϩάΛग़͢͜ͱͰɺෆଌͷࣄଶ ͕͋ͬͨ৔߹ʹޙ͔Βௐ͕ࠪͰ͖Δɻ·ͨɺෆਖ਼ͳΞΫηε౳Λݕ஌ Ͱ͖Δ͜ͱ΋͋Δ •σόοάɾύϑΥʔϚϯε: ୯७ͳΞϓϦέʔγϣϯϩάͰΘ͔Βͳ ͍಺༰Λ୳Δ

Slide 11

Slide 11 text

What to trace? Kubernetes/ API Host Linux Per-Container Apps (Networking)

Slide 12

Slide 12 text

Methodology

Slide 13

Slide 13 text

Kubernetes audit - orchestrator

Slide 14

Slide 14 text

Falco / sysdig - host, containers

Slide 15

Slide 15 text

Falco as a audit tool •ϧʔϧϕʔεͰ༷ʑͳ΋ͷΛ؂ࠪɻ •ϑΝΠϧૢ࡞ɺϓϩηεɺsyslog... •ref: Wazuh/OSSec https://wazuh.com/ •ίϯςφʹಛԽͨ͠؂ࠪϧʔϧ •trusted_images, falco_sensitive_mount_images, ... https://github.com/falcosecurity/falco/blob/dev/rules/falco_rules.yaml

Slide 16

Slide 16 text

Falco internal •؂ࠪ͢Δ৘ใͷιʔε͸େ͖͘͸ΧʔωϧϞδϡʔϧɻ •sysdig(~0.6), falco-probe(0.6~) •> The kernel modules are actually built from the same source code •eBPF΋಺෦Ͱ࢖͑ΔΑ͏ʹͳ͍ͬͯΔ • https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf/

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

eBPF?

Slide 19

Slide 19 text

“Berkley Packet Filter” •ݩʑ͸ύέοτϑΟϧλͷख๏ͷ࿦จ (classic BPF, 1993) •Tcpdump ͷத਎ͱͯ͠׆༂ •ύέοτϑΟϧλҎ֎: Seccomp Ͱ΋࢖ΘΕΔΑ͏ʹͳΔ •Linux 3.14 (2014)͔Βେ͖ͳมߋɺࠓͷܗʹۙͮ͘
 (extended BPF) ʮBerkeley Packet FilterʢBPFʣೖ໳ʢ1ʣʯ https://www.atmarkit.co.jp/ait/articles/1811/21/news010.html http://www.tcpdump.org/papers/bpf-usenix93.pdf

Slide 20

Slide 20 text

eBPF overview •BPFόΠτίʔυΛͭ͘Δ
 ʢ৭ʑͳํ๏Ͱ࡞Δʣ •ΧʔωϧͰݕࠪ͞ΕɺඞཁʹԠ͡JIT •Χʔωϧ಺ͷΠϕϯτΛϓϩάϥϜ͕ऩू •BPF map ͱ͍͏໊લͷ
 Χʔωϧ಺ूੵମ͕͋Δʢͱͬͯ΋ߴ଎ʣ From: https://www.atmarkit.co.jp/ait/articles/1811/21/news010_2.html

Slide 21

Slide 21 text

Tools •bpftrace(8) - ಺෦ͰeBPFΛ࢖͏൚༻తτϨʔαʔ •DTraceݴޠͦͬ͘ΓͷεΫϦϓτͰτϨʔε಺༰Λهड़ •BCC - eBPF ͷػೳΛϥοϓͨ͠ϓϩάϥϜΛ࡞ΔͨΊͷϥΠϒϥϦ •Python, Lua, C++ •Ruby ࣮૷ - RbBCC (੿࡞)

Slide 22

Slide 22 text

Existing Linux tracers Tool Ability Key sys call Invasivity gdb ϓϩάϥϜͷεςοϓ࣮ߦɺ
 γάφϧͳͲͰͷఀࢭ ptrace(2) Large strace γεςϜίʔϧͷ௥੻ ptrace(2) Large perf ύϑΥʔϚϯεΧ΢ϯλͳͲͷ
 ूܭͱՄࢹԽ perf_event_open(2) Medium bpftrace/BCC ͋ΒΏΔΧʔωϧΠϕϯτͷ
 ूܭͱՄࢹԽ bpf(2) Smaller

Slide 23

Slide 23 text

Comparison to gdb/strace •gdb/strace ྆ํͱ΋伴ͱͳΔγεςϜίʔϧ͸ ptrace(2) •࢓૊Έ্ɺҰ౓ϓϩάϥϜΛࢭΊΔඞཁ͕͋Δ •ࢭΊ͍ͯΔ͔Βͦ͜ྫ͑͹ϨδελΛߋ৽ͨ͠ΓɺΑΓϓϩάϥϜͷ ڍಈʹ౿ΈࠐΜͩૢ࡞͕ՄೳͰ΋͋Δ ʮptraceγεςϜίʔϧೖ໳ʯ https://itchyny.hatenablog.com/entry/2017/07/31/090000

Slide 24

Slide 24 text

Comparison to perf •perf ͸ tracepoint ͳͲɺ eBPF ͕औಘͰ͖ΔΑ͏ͳ৘ใͷଟ͘Λಉ͡ Α͏ʹऔಘͰ͖Δ •Ұํɺूܭ͸ɺྫ͑͹ϓϩʔϒ͝ͱʹ perf_event_open(2) ͯ͠ɺ
 ϢʔβϥϯυͰूܭ͢ΔͳͲΦʔόϔου͕ແࢹͰ͖ͳ͍
 ʮ؍ଌऀޮՌʯ •eBPF͸ΧʔωϧͰϑΟϧλɺूܭ(eBPF map)͕Ͱ͖Δɻ
 DTrace ʹ͍ۙɻ

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

eBPF and Kernel events

Slide 27

Slide 27 text

eBPF event source http://www.brendangregg.com/blog/2019-07-15/bpf-performance-tools-book.html

Slide 28

Slide 28 text

Important source for tracing •perf, ftrace, eBPF Ͱಉ͡ιʔεΛ࢖͏ ʮperf, ftraceͷ͘͠Έʯ http://mmi.hatenablog.com/entry/2018/03/04/052249

Slide 29

Slide 29 text

tracepoint •LinuxΧʔωϧʹ͸ɺ಺෦Ͱى͜Δ༷ʑͳΠϕϯτΛ
 τϨʔε͢ΔͨΊͷϑοΫϙΠϯτ͕૊Έࠐ·Ε͍ͯΔɻ •ͦΕΒΛ tracepoint ͱݺͿɻΧʔωϧͷཚ਺ػೳΛ࢖ͬͨ࣌ͷΠϕϯ τͷྫ

Slide 30

Slide 30 text

kprobe •tracepoint͸جຊతʹ͋Β͔͡ΊΧʔωϧ։ൃऀ͕༻ҙͨ͠
 ϑοΫϙΠϯτ͔͠τϨʔεͰ͖ͳ͍ɻ •ࣗ෼ͰɺಛఆͷΧʔωϧؔ਺ͷݺͼग़͠ΛτϨʔε͍ͨ͠৔߹ kprobe Λ࢖͏ɻόʔδϣϯɺΞʔΩςΫνϟͰҟͳΔ͜ͱʹ஫ҙ͢Δ

Slide 31

Slide 31 text

uprobe •ϢʔβۭؒͷϓϩάϥϜͷڍಈΛɺΧʔωϧଆͰ௥͍͔͚ΒΕΔ •uprobe ͸ɺόΠφϦ୯Ґʢਖ਼֬ʹ͸ͦͷ࣮ߦϑΝΠϧͷinode୯Ґͱ ͷ͜ͱʣͰΠϕϯτΛొ࿥͢Δඞཁ͕͋Δɻ •ྫ͑͹ɺόΠφϦͰݟ͍͑ͯΔؔ਺Λొ࿥͢Δ

Slide 32

Slide 32 text

USDT •User Statically Defined Tracepoint •ϢʔβϓϩάϥϜͷ೚ҙͷՕॴʹprobeΛ࢓ֻ͚ɺΦʔόʔϔουগ ͳ͘ར༻͢Δ͜ͱ͕Ͱ͖Δɻʢத਎ͱͯ͠͸uprobeʹͳΔ໛༷ʣ

Slide 33

Slide 33 text

Others •perfͰ࢖͏Α͏ͳϋʔυ΢ΣΞ΍ιϑτ΢ΣΞΧ΢ϯλͳͲ΋eBPF͔ Βѻ͑Δɻ •bpftrace ͷϚχϡΞϧͰ͸ɺhardwareϓϩόΠμɺ softwareϓϩόΠ μɺϝϞϦͷwatchpointϓϩόΠμ͕ଘࡏ͢Δ

Slide 34

Slide 34 text

“Raw” usage of tracefs •tracefs Λܦ༝ͯ͠ɺeBPFͳ͠Ͱ΋ΧʔωϧτϨʔεՄೳ
 (debugfs͔Βݟ͑Δ΋ͷͱಉ͡ɺΑΓݶఆతͳػೳ͔͠ݟͤͳ͍) ʮࣗ෼ͷͨΊͷΧʔωϧτϨʔγϯάɺͦͷ1ʯ https://udzura.hatenablog.jp/entry/2019/09/02/174801 echo "p:myprobe1 $sym" >> \ /sys/kernel/debug/tracing/kprobe_events ʮftrace Λ࢖ͬͨίϯςφ಺σόοάͷ४උʯ https://speakerdeck.com/kentatada/container-debug-using-ftrace

Slide 35

Slide 35 text

ping͕connectΛଧͭτϨʔε

Slide 36

Slide 36 text

OK, what is good with containers?

Slide 37

Slide 37 text

eBPF use case •Debugging HOST Linux itself •Syscalls or kernel functions around containers •Runtime performance •bpftrace result to Prometheus for monitoring •Tracing events per container •Cgroup v2 with eBPF •Tracee by AquaSeciruty

Slide 38

Slide 38 text

Tracing kernel on containers •ίϯςφ͸༷ʑͳΧʔωϧػೳΛ࢖͏ͷͰɺͦͷΧʔωϧػೳࣗମΛ σόοάͨ͠Γܭଌͨ͠Γ͢Δ͜ͱ͕eBPFͰͰ͖Δɻ •ྫ͑͹: `ip netns add/del` •಺෦Ͱ copy_net_ns/cleanup_net ͱ͍͏Χʔωϧؔ਺ΛݺͿ •͜ΕΒ͸͞Βʹ಺෦Ͱ͸ΧʔωϧͷόʔδϣϯʹΑΓϩοΫΛऔΔͷ ͰɺύϑΥʔϚϯεӨڹͳͲΛௐ΂͍ͨˠ eBPF Ͱʂ

Slide 39

Slide 39 text

Demo (1)

Slide 40

Slide 40 text

Reference •ʮLinux Kernel: rtnl_mutex Λ௕࣌ؒ ϩοΫͯࢗͬͨ͠͞ঢ়ଶΛ؍࡯͢Δʯ •https://hiboma.hatenadiary.jp/entry/2019/10/29/123455 •ʢ༨ஊͰ͕͢hiboma͞Μͷ͓͔͛Ͱ /proc/$pid/stack ΍ wchan ͷ࢖͍ํΛ ೺Ѳ͠·ͨ͠ʣ

Slide 41

Slide 41 text

Tracing Runtime •ʢ੿࡞ίϯςφHaconiwaͰʣҎԼΛܭଌͯ͠Έͨ •ίϯςφϥϯλΠϜͷىಈʙexecve͢Δ·Ͱͷ࣌ؒ •ίϯςφϥϯλΠϜͷىಈʙίϯςφ͕listen͢Δ·Ͱͷ࣌ؒ •USDTͱtracepointͷ
 ૊Έ߹Θͤ

Slide 42

Slide 42 text

bpftrace script

Slide 43

Slide 43 text

bpftrace → Prometheus •bt2prom ͱ͍͏πʔϧΛॻ͍ͨɻ •bpftraceͷు͖ग़͢JSONϑΥʔϚοτΛɺPrometheusՄ׵ͷϑΥʔ Ϛοτʹม׵ɻ •ͦͷ·· Textfile exporter ͷσΟϨΫτϦʹஔ͍ͨΒϓϩοτՄೳ •Cron ͳͲͰʢsarΈ͍ͨͳΠϝʔδͰʣఆظ࣮ߦ͢ΔͷΛ૝ఆ “Format bpftrace JSON into prometheus-compat textfile” https://github.com/udzura/mruby-bin-bt2prom

Slide 44

Slide 44 text

ࡶʹ vfs_read ΛτϥοΫͨ͠ྫ

Slide 45

Slide 45 text

CGroup v2 x eBPF •BPFͷcgroupઐ༻ؔ਺ - ࣮ߦ͞ΕͨεϨου͕ॴଐ͢Δcgroup͕Θ͔ Δɻ BPF_FUNC_get_current_cgroup_id ΄͔ •Χʔωϧ͕ΊͪΌ৽͘͠ͳ͍ͱ࢖͑ͳ͍... ͕ɺศར •ίϯςφ୯ҐͰɺͲͷΑ͏ͳϑΝΠϧ͕Φʔϓϯ͞ΕΔ͔ͷτϨʔε ͳͲ͕༰қʹͰ͖Δ •e.g. Apache HTTPDίϯςφ͕ϦΫΤετຖʹ։͘ϑΝΠϧͷsnoop

Slide 46

Slide 46 text

Demo (2) ͕࣌ؒͳ͍Ͱ͢ɺੋඇ௚઀͓੠͕͚Λʂ

Slide 47

Slide 47 text

Tracee •eBPFΛશ໘తʹ࢖͏ίϯςφτϨʔα࣮૷ •಺෦ͰPID → NamespaceΛղܾͳͲ •bpftrace/BCC͸൚༻తͳͷͰɺ
 ಛԽͨ͠ػೳʹظ଴ https://blog.aquasec.com/ebpf-tracing-containers

Slide 48

Slide 48 text

Conclusion

Slide 49

Slide 49 text

Happy publishing!

Slide 50

Slide 50 text

We’re moving to cgroup v2 •Moby ͷ cgroup v2 ରԠP/R (WIP) •Systemd ͷ v2 default Խ (from 243)

Slide 51

Slide 51 text

What is new in cgroup v2 (Reprise) •Unified Hierarchy •CGroup-aware OOM Killer •nsdelegate and better cgroup namespace •PSI - Pressure Stall Information •BPF helper for cgroup v2
 (such as BPF_FUNC_get_current_cgroup_id, ...)

Slide 52

Slide 52 text

It should be “per-container” •Load Avarage •Memory usage •psutils, top, vmstat... •netstat, iostat •syslog, auditd •perf Host-wide Per-Container •Cgroup stat •PSI(especially) •eBPF (per container) •USDT, syscalls... •sysdig/falco •perf --cgroup

Slide 53

Slide 53 text

Understand new feature to use new tools in a better way