Tracing the Containers (mainly about eBPF)

2cf373725ded741824c50fd571eda6e1?s=47 KONDO Uchio
November 28, 2019

Tracing the Containers (mainly about eBPF)

Presented @ CNDK 2019

2cf373725ded741824c50fd571eda6e1?s=128

KONDO Uchio

November 28, 2019
Tweet

Transcript

  1. audit, falco, ... and eBPF! Uchio Kondo @ GMO Pepabo,

    Inc.
 #CNDK2019 Tracing the Containers Image from pixabay: https://pixabay.com/images/id-984050/
  2. Señor-Principal Engineer @ GMO Pepabo, Inc. Uchio Kondo https://blog.udzura.jp/ @udzura

    Technical department, Dev Productivity/R&D Team Chair on CNDJ at Fukuoka, 2019.04 Systems programmer wannabe Duolingo freak (Emerald League)
  3. JapanContainerDays 2018.12 •CRIU

  4. CNDF 2019 Spring

  5. CNDT 2019 summer •cgroup v2
 & PSI

  6. Intertested: •Container features in Linux Kernel (namespace, cgroup, capability, ...)

    •System calls •Kernel programming interfaces •eBPF (<= New!!) •The most favorite struct: struct task_struct
  7. Today

  8. ToC •Rough overview of Container tracing (5m~) •Introducing to eBPF

    •Comparison to existing tracers •Kernel events (~ 5m) •Use cases with some DEMO (~ 10m)
  9. Tracing Your containers

  10. Why tracing? •τϨʔεʹ͸ҎԼͷΑ͏ͳ໨త͕͋Δ •ϩΪϯά: ෳࡶͳΞϓϦέʔγϣϯͰԿ͕͓͖͍ͯΔ͔೺Ѳ •؂ࠪɾηΩϡϦςΟ: ඞཁͳτϨʔεϩάΛग़͢͜ͱͰɺෆଌͷࣄଶ ͕͋ͬͨ৔߹ʹޙ͔Βௐ͕ࠪͰ͖Δɻ·ͨɺෆਖ਼ͳΞΫηε౳Λݕ஌ Ͱ͖Δ͜ͱ΋͋Δ •σόοάɾύϑΥʔϚϯε:

    ୯७ͳΞϓϦέʔγϣϯϩάͰΘ͔Βͳ ͍಺༰Λ୳Δ
  11. What to trace? Kubernetes/ API Host Linux Per-Container Apps (Networking)

  12. Methodology

  13. Kubernetes audit - orchestrator

  14. Falco / sysdig - host, containers

  15. Falco as a audit tool •ϧʔϧϕʔεͰ༷ʑͳ΋ͷΛ؂ࠪɻ •ϑΝΠϧૢ࡞ɺϓϩηεɺsyslog... •ref: Wazuh/OSSec https://wazuh.com/

    •ίϯςφʹಛԽͨ͠؂ࠪϧʔϧ •trusted_images, falco_sensitive_mount_images, ... https://github.com/falcosecurity/falco/blob/dev/rules/falco_rules.yaml
  16. Falco internal •؂ࠪ͢Δ৘ใͷιʔε͸େ͖͘͸ΧʔωϧϞδϡʔϧɻ •sysdig(~0.6), falco-probe(0.6~) •> The kernel modules are

    actually built from the same source code •eBPF΋಺෦Ͱ࢖͑ΔΑ͏ʹͳ͍ͬͯΔ • https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf/
  17. None
  18. eBPF?

  19. “Berkley Packet Filter” •ݩʑ͸ύέοτϑΟϧλͷख๏ͷ࿦จ (classic BPF, 1993) •Tcpdump ͷத਎ͱͯ͠׆༂ •ύέοτϑΟϧλҎ֎:

    Seccomp Ͱ΋࢖ΘΕΔΑ͏ʹͳΔ •Linux 3.14 (2014)͔Βେ͖ͳมߋɺࠓͷܗʹۙͮ͘
 (extended BPF) ʮBerkeley Packet FilterʢBPFʣೖ໳ʢ1ʣʯ https://www.atmarkit.co.jp/ait/articles/1811/21/news010.html http://www.tcpdump.org/papers/bpf-usenix93.pdf
  20. eBPF overview •BPFόΠτίʔυΛͭ͘Δ
 ʢ৭ʑͳํ๏Ͱ࡞Δʣ •ΧʔωϧͰݕࠪ͞ΕɺඞཁʹԠ͡JIT •Χʔωϧ಺ͷΠϕϯτΛϓϩάϥϜ͕ऩू •BPF map ͱ͍͏໊લͷ
 Χʔωϧ಺ूੵମ͕͋Δʢͱͬͯ΋ߴ଎ʣ

    From: https://www.atmarkit.co.jp/ait/articles/1811/21/news010_2.html
  21. Tools •bpftrace(8) - ಺෦ͰeBPFΛ࢖͏൚༻తτϨʔαʔ •DTraceݴޠͦͬ͘ΓͷεΫϦϓτͰτϨʔε಺༰Λهड़ •BCC - eBPF ͷػೳΛϥοϓͨ͠ϓϩάϥϜΛ࡞ΔͨΊͷϥΠϒϥϦ •Python,

    Lua, C++ •Ruby ࣮૷ - RbBCC (੿࡞)
  22. Existing Linux tracers Tool Ability Key sys call Invasivity gdb

    ϓϩάϥϜͷεςοϓ࣮ߦɺ
 γάφϧͳͲͰͷఀࢭ ptrace(2) Large strace γεςϜίʔϧͷ௥੻ ptrace(2) Large perf ύϑΥʔϚϯεΧ΢ϯλͳͲͷ
 ूܭͱՄࢹԽ perf_event_open(2) Medium bpftrace/BCC ͋ΒΏΔΧʔωϧΠϕϯτͷ
 ूܭͱՄࢹԽ bpf(2) Smaller
  23. Comparison to gdb/strace •gdb/strace ྆ํͱ΋伴ͱͳΔγεςϜίʔϧ͸ ptrace(2) •࢓૊Έ্ɺҰ౓ϓϩάϥϜΛࢭΊΔඞཁ͕͋Δ •ࢭΊ͍ͯΔ͔Βͦ͜ྫ͑͹ϨδελΛߋ৽ͨ͠ΓɺΑΓϓϩάϥϜͷ ڍಈʹ౿ΈࠐΜͩૢ࡞͕ՄೳͰ΋͋Δ ʮptraceγεςϜίʔϧೖ໳ʯ

    https://itchyny.hatenablog.com/entry/2017/07/31/090000
  24. Comparison to perf •perf ͸ tracepoint ͳͲɺ eBPF ͕औಘͰ͖ΔΑ͏ͳ৘ใͷଟ͘Λಉ͡ Α͏ʹऔಘͰ͖Δ

    •Ұํɺूܭ͸ɺྫ͑͹ϓϩʔϒ͝ͱʹ perf_event_open(2) ͯ͠ɺ
 ϢʔβϥϯυͰूܭ͢ΔͳͲΦʔόϔου͕ແࢹͰ͖ͳ͍
 ʮ؍ଌऀޮՌʯ •eBPF͸ΧʔωϧͰϑΟϧλɺूܭ(eBPF map)͕Ͱ͖Δɻ
 DTrace ʹ͍ۙɻ
  25. None
  26. eBPF and Kernel events

  27. eBPF event source http://www.brendangregg.com/blog/2019-07-15/bpf-performance-tools-book.html

  28. Important source for tracing •perf, ftrace, eBPF Ͱಉ͡ιʔεΛ࢖͏ ʮperf, ftraceͷ͘͠Έʯ

    http://mmi.hatenablog.com/entry/2018/03/04/052249
  29. tracepoint •LinuxΧʔωϧʹ͸ɺ಺෦Ͱى͜Δ༷ʑͳΠϕϯτΛ
 τϨʔε͢ΔͨΊͷϑοΫϙΠϯτ͕૊Έࠐ·Ε͍ͯΔɻ •ͦΕΒΛ tracepoint ͱݺͿɻΧʔωϧͷཚ਺ػೳΛ࢖ͬͨ࣌ͷΠϕϯ τͷྫ

  30. kprobe •tracepoint͸جຊతʹ͋Β͔͡ΊΧʔωϧ։ൃऀ͕༻ҙͨ͠
 ϑοΫϙΠϯτ͔͠τϨʔεͰ͖ͳ͍ɻ •ࣗ෼ͰɺಛఆͷΧʔωϧؔ਺ͷݺͼग़͠ΛτϨʔε͍ͨ͠৔߹ kprobe Λ࢖͏ɻόʔδϣϯɺΞʔΩςΫνϟͰҟͳΔ͜ͱʹ஫ҙ͢Δ

  31. uprobe •ϢʔβۭؒͷϓϩάϥϜͷڍಈΛɺΧʔωϧଆͰ௥͍͔͚ΒΕΔ •uprobe ͸ɺόΠφϦ୯Ґʢਖ਼֬ʹ͸ͦͷ࣮ߦϑΝΠϧͷinode୯Ґͱ ͷ͜ͱʣͰΠϕϯτΛొ࿥͢Δඞཁ͕͋Δɻ •ྫ͑͹ɺόΠφϦͰݟ͍͑ͯΔؔ਺Λొ࿥͢Δ

  32. USDT •User Statically Defined Tracepoint •ϢʔβϓϩάϥϜͷ೚ҙͷՕॴʹprobeΛ࢓ֻ͚ɺΦʔόʔϔουগ ͳ͘ར༻͢Δ͜ͱ͕Ͱ͖Δɻʢத਎ͱͯ͠͸uprobeʹͳΔ໛༷ʣ

  33. Others •perfͰ࢖͏Α͏ͳϋʔυ΢ΣΞ΍ιϑτ΢ΣΞΧ΢ϯλͳͲ΋eBPF͔ Βѻ͑Δɻ •bpftrace ͷϚχϡΞϧͰ͸ɺhardwareϓϩόΠμɺ softwareϓϩόΠ μɺϝϞϦͷwatchpointϓϩόΠμ͕ଘࡏ͢Δ

  34. “Raw” usage of tracefs •tracefs Λܦ༝ͯ͠ɺeBPFͳ͠Ͱ΋ΧʔωϧτϨʔεՄೳ
 (debugfs͔Βݟ͑Δ΋ͷͱಉ͡ɺΑΓݶఆతͳػೳ͔͠ݟͤͳ͍) ʮࣗ෼ͷͨΊͷΧʔωϧτϨʔγϯάɺͦͷ1ʯ https://udzura.hatenablog.jp/entry/2019/09/02/174801 echo

    "p:myprobe1 $sym" >> \ /sys/kernel/debug/tracing/kprobe_events ʮftrace Λ࢖ͬͨίϯςφ಺σόοάͷ४උʯ https://speakerdeck.com/kentatada/container-debug-using-ftrace
  35. ping͕connectΛଧͭτϨʔε

  36. OK, what is good with containers?

  37. eBPF use case •Debugging HOST Linux itself •Syscalls or kernel

    functions around containers •Runtime performance •bpftrace result to Prometheus for monitoring •Tracing events per container •Cgroup v2 with eBPF •Tracee by AquaSeciruty
  38. Tracing kernel on containers •ίϯςφ͸༷ʑͳΧʔωϧػೳΛ࢖͏ͷͰɺͦͷΧʔωϧػೳࣗମΛ σόοάͨ͠Γܭଌͨ͠Γ͢Δ͜ͱ͕eBPFͰͰ͖Δɻ •ྫ͑͹: `ip netns add/del`

    •಺෦Ͱ copy_net_ns/cleanup_net ͱ͍͏Χʔωϧؔ਺ΛݺͿ •͜ΕΒ͸͞Βʹ಺෦Ͱ͸ΧʔωϧͷόʔδϣϯʹΑΓϩοΫΛऔΔͷ ͰɺύϑΥʔϚϯεӨڹͳͲΛௐ΂͍ͨˠ eBPF Ͱʂ
  39. Demo (1)

  40. Reference •ʮLinux Kernel: rtnl_mutex Λ௕࣌ؒ ϩοΫͯࢗͬͨ͠͞ঢ়ଶΛ؍࡯͢Δʯ •https://hiboma.hatenadiary.jp/entry/2019/10/29/123455 •ʢ༨ஊͰ͕͢hiboma͞Μͷ͓͔͛Ͱ /proc/$pid/stack ΍

    wchan ͷ࢖͍ํΛ ೺Ѳ͠·ͨ͠ʣ
  41. Tracing Runtime •ʢ੿࡞ίϯςφHaconiwaͰʣҎԼΛܭଌͯ͠Έͨ •ίϯςφϥϯλΠϜͷىಈʙexecve͢Δ·Ͱͷ࣌ؒ •ίϯςφϥϯλΠϜͷىಈʙίϯςφ͕listen͢Δ·Ͱͷ࣌ؒ •USDTͱtracepointͷ
 ૊Έ߹Θͤ

  42. bpftrace script

  43. bpftrace → Prometheus •bt2prom ͱ͍͏πʔϧΛॻ͍ͨɻ •bpftraceͷు͖ग़͢JSONϑΥʔϚοτΛɺPrometheusՄ׵ͷϑΥʔ Ϛοτʹม׵ɻ •ͦͷ·· Textfile exporter

    ͷσΟϨΫτϦʹஔ͍ͨΒϓϩοτՄೳ •Cron ͳͲͰʢsarΈ͍ͨͳΠϝʔδͰʣఆظ࣮ߦ͢ΔͷΛ૝ఆ “Format bpftrace JSON into prometheus-compat textfile” https://github.com/udzura/mruby-bin-bt2prom
  44. ࡶʹ vfs_read ΛτϥοΫͨ͠ྫ

  45. CGroup v2 x eBPF •BPFͷcgroupઐ༻ؔ਺ - ࣮ߦ͞ΕͨεϨου͕ॴଐ͢Δcgroup͕Θ͔ Δɻ BPF_FUNC_get_current_cgroup_id ΄͔

    •Χʔωϧ͕ΊͪΌ৽͘͠ͳ͍ͱ࢖͑ͳ͍... ͕ɺศར •ίϯςφ୯ҐͰɺͲͷΑ͏ͳϑΝΠϧ͕Φʔϓϯ͞ΕΔ͔ͷτϨʔε ͳͲ͕༰қʹͰ͖Δ •e.g. Apache HTTPDίϯςφ͕ϦΫΤετຖʹ։͘ϑΝΠϧͷsnoop
  46. Demo (2) ͕࣌ؒͳ͍Ͱ͢ɺੋඇ௚઀͓੠͕͚Λʂ

  47. Tracee •eBPFΛશ໘తʹ࢖͏ίϯςφτϨʔα࣮૷ •಺෦ͰPID → NamespaceΛղܾͳͲ •bpftrace/BCC͸൚༻తͳͷͰɺ
 ಛԽͨ͠ػೳʹظ଴ https://blog.aquasec.com/ebpf-tracing-containers

  48. Conclusion

  49. Happy publishing!

  50. We’re moving to cgroup v2 •Moby ͷ cgroup v2 ରԠP/R

    (WIP) •Systemd ͷ v2 default Խ (from 243)
  51. What is new in cgroup v2 (Reprise) •Unified Hierarchy •CGroup-aware

    OOM Killer •nsdelegate and better cgroup namespace •PSI - Pressure Stall Information •BPF helper for cgroup v2
 (such as BPF_FUNC_get_current_cgroup_id, ...)
  52. It should be “per-container” •Load Avarage •Memory usage •psutils, top,

    vmstat... •netstat, iostat •syslog, auditd •perf Host-wide Per-Container •Cgroup stat •PSI(especially) •eBPF (per container) •USDT, syscalls... •sysdig/falco •perf --cgroup
  53. Understand new feature to use new tools in a better

    way