Tracing the Containers (mainly about eBPF)

2cf373725ded741824c50fd571eda6e1?s=47 KONDO Uchio
November 28, 2019

Tracing the Containers (mainly about eBPF)

Presented @ CNDK 2019

2cf373725ded741824c50fd571eda6e1?s=128

KONDO Uchio

November 28, 2019
Tweet

Transcript

  1. 1.

    audit, falco, ... and eBPF! Uchio Kondo @ GMO Pepabo,

    Inc.
 #CNDK2019 Tracing the Containers Image from pixabay: https://pixabay.com/images/id-984050/
  2. 2.

    Señor-Principal Engineer @ GMO Pepabo, Inc. Uchio Kondo https://blog.udzura.jp/ @udzura

    Technical department, Dev Productivity/R&D Team Chair on CNDJ at Fukuoka, 2019.04 Systems programmer wannabe Duolingo freak (Emerald League)
  3. 6.

    Intertested: •Container features in Linux Kernel (namespace, cgroup, capability, ...)

    •System calls •Kernel programming interfaces •eBPF (<= New!!) •The most favorite struct: struct task_struct
  4. 7.
  5. 8.

    ToC •Rough overview of Container tracing (5m~) •Introducing to eBPF

    •Comparison to existing tracers •Kernel events (~ 5m) •Use cases with some DEMO (~ 10m)
  6. 15.

    Falco as a audit tool •ϧʔϧϕʔεͰ༷ʑͳ΋ͷΛ؂ࠪɻ •ϑΝΠϧૢ࡞ɺϓϩηεɺsyslog... •ref: Wazuh/OSSec https://wazuh.com/

    •ίϯςφʹಛԽͨ͠؂ࠪϧʔϧ •trusted_images, falco_sensitive_mount_images, ... https://github.com/falcosecurity/falco/blob/dev/rules/falco_rules.yaml
  7. 16.

    Falco internal •؂ࠪ͢Δ৘ใͷιʔε͸େ͖͘͸ΧʔωϧϞδϡʔϧɻ •sysdig(~0.6), falco-probe(0.6~) •> The kernel modules are

    actually built from the same source code •eBPF΋಺෦Ͱ࢖͑ΔΑ͏ʹͳ͍ͬͯΔ • https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf/
  8. 17.
  9. 18.
  10. 19.

    “Berkley Packet Filter” •ݩʑ͸ύέοτϑΟϧλͷख๏ͷ࿦จ (classic BPF, 1993) •Tcpdump ͷத਎ͱͯ͠׆༂ •ύέοτϑΟϧλҎ֎:

    Seccomp Ͱ΋࢖ΘΕΔΑ͏ʹͳΔ •Linux 3.14 (2014)͔Βେ͖ͳมߋɺࠓͷܗʹۙͮ͘
 (extended BPF) ʮBerkeley Packet FilterʢBPFʣೖ໳ʢ1ʣʯ https://www.atmarkit.co.jp/ait/articles/1811/21/news010.html http://www.tcpdump.org/papers/bpf-usenix93.pdf
  11. 22.

    Existing Linux tracers Tool Ability Key sys call Invasivity gdb

    ϓϩάϥϜͷεςοϓ࣮ߦɺ
 γάφϧͳͲͰͷఀࢭ ptrace(2) Large strace γεςϜίʔϧͷ௥੻ ptrace(2) Large perf ύϑΥʔϚϯεΧ΢ϯλͳͲͷ
 ूܭͱՄࢹԽ perf_event_open(2) Medium bpftrace/BCC ͋ΒΏΔΧʔωϧΠϕϯτͷ
 ूܭͱՄࢹԽ bpf(2) Smaller
  12. 24.

    Comparison to perf •perf ͸ tracepoint ͳͲɺ eBPF ͕औಘͰ͖ΔΑ͏ͳ৘ใͷଟ͘Λಉ͡ Α͏ʹऔಘͰ͖Δ

    •Ұํɺूܭ͸ɺྫ͑͹ϓϩʔϒ͝ͱʹ perf_event_open(2) ͯ͠ɺ
 ϢʔβϥϯυͰूܭ͢ΔͳͲΦʔόϔου͕ແࢹͰ͖ͳ͍
 ʮ؍ଌऀޮՌʯ •eBPF͸ΧʔωϧͰϑΟϧλɺूܭ(eBPF map)͕Ͱ͖Δɻ
 DTrace ʹ͍ۙɻ
  13. 25.
  14. 34.

    “Raw” usage of tracefs •tracefs Λܦ༝ͯ͠ɺeBPFͳ͠Ͱ΋ΧʔωϧτϨʔεՄೳ
 (debugfs͔Βݟ͑Δ΋ͷͱಉ͡ɺΑΓݶఆతͳػೳ͔͠ݟͤͳ͍) ʮࣗ෼ͷͨΊͷΧʔωϧτϨʔγϯάɺͦͷ1ʯ https://udzura.hatenablog.jp/entry/2019/09/02/174801 echo

    "p:myprobe1 $sym" >> \ /sys/kernel/debug/tracing/kprobe_events ʮftrace Λ࢖ͬͨίϯςφ಺σόοάͷ४උʯ https://speakerdeck.com/kentatada/container-debug-using-ftrace
  15. 37.

    eBPF use case •Debugging HOST Linux itself •Syscalls or kernel

    functions around containers •Runtime performance •bpftrace result to Prometheus for monitoring •Tracing events per container •Cgroup v2 with eBPF •Tracee by AquaSeciruty
  16. 38.

    Tracing kernel on containers •ίϯςφ͸༷ʑͳΧʔωϧػೳΛ࢖͏ͷͰɺͦͷΧʔωϧػೳࣗମΛ σόοάͨ͠Γܭଌͨ͠Γ͢Δ͜ͱ͕eBPFͰͰ͖Δɻ •ྫ͑͹: `ip netns add/del`

    •಺෦Ͱ copy_net_ns/cleanup_net ͱ͍͏Χʔωϧؔ਺ΛݺͿ •͜ΕΒ͸͞Βʹ಺෦Ͱ͸ΧʔωϧͷόʔδϣϯʹΑΓϩοΫΛऔΔͷ ͰɺύϑΥʔϚϯεӨڹͳͲΛௐ΂͍ͨˠ eBPF Ͱʂ
  17. 39.
  18. 43.

    bpftrace → Prometheus •bt2prom ͱ͍͏πʔϧΛॻ͍ͨɻ •bpftraceͷు͖ग़͢JSONϑΥʔϚοτΛɺPrometheusՄ׵ͷϑΥʔ Ϛοτʹม׵ɻ •ͦͷ·· Textfile exporter

    ͷσΟϨΫτϦʹஔ͍ͨΒϓϩοτՄೳ •Cron ͳͲͰʢsarΈ͍ͨͳΠϝʔδͰʣఆظ࣮ߦ͢ΔͷΛ૝ఆ “Format bpftrace JSON into prometheus-compat textfile” https://github.com/udzura/mruby-bin-bt2prom
  19. 45.

    CGroup v2 x eBPF •BPFͷcgroupઐ༻ؔ਺ - ࣮ߦ͞ΕͨεϨου͕ॴଐ͢Δcgroup͕Θ͔ Δɻ BPF_FUNC_get_current_cgroup_id ΄͔

    •Χʔωϧ͕ΊͪΌ৽͘͠ͳ͍ͱ࢖͑ͳ͍... ͕ɺศར •ίϯςφ୯ҐͰɺͲͷΑ͏ͳϑΝΠϧ͕Φʔϓϯ͞ΕΔ͔ͷτϨʔε ͳͲ͕༰қʹͰ͖Δ •e.g. Apache HTTPDίϯςφ͕ϦΫΤετຖʹ։͘ϑΝΠϧͷsnoop
  20. 50.

    We’re moving to cgroup v2 •Moby ͷ cgroup v2 ରԠP/R

    (WIP) •Systemd ͷ v2 default Խ (from 243)
  21. 51.

    What is new in cgroup v2 (Reprise) •Unified Hierarchy •CGroup-aware

    OOM Killer •nsdelegate and better cgroup namespace •PSI - Pressure Stall Information •BPF helper for cgroup v2
 (such as BPF_FUNC_get_current_cgroup_id, ...)
  22. 52.

    It should be “per-container” •Load Avarage •Memory usage •psutils, top,

    vmstat... •netstat, iostat •syslog, auditd •perf Host-wide Per-Container •Cgroup stat •PSI(especially) •eBPF (per container) •USDT, syscalls... •sysdig/falco •perf --cgroup