Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tracing the Containers (mainly about eBPF)

KONDO Uchio
November 28, 2019

Tracing the Containers (mainly about eBPF)

Presented @ CNDK 2019

KONDO Uchio

November 28, 2019
Tweet

More Decks by KONDO Uchio

Other Decks in Technology

Transcript

  1. audit, falco, ... and eBPF!
    Uchio Kondo @ GMO Pepabo, Inc.

    #CNDK2019
    Tracing the Containers
    Image from pixabay: https://pixabay.com/images/id-984050/

    View full-size slide

  2. Señor-Principal Engineer @ GMO Pepabo, Inc.
    Uchio Kondo
    https://blog.udzura.jp/
    @udzura
    Technical department, Dev Productivity/R&D Team
    Chair on CNDJ at Fukuoka, 2019.04
    Systems programmer wannabe
    Duolingo freak (Emerald League)

    View full-size slide

  3. JapanContainerDays 2018.12
    •CRIU

    View full-size slide

  4. CNDF 2019 Spring

    View full-size slide

  5. CNDT 2019 summer
    •cgroup v2

    & PSI

    View full-size slide

  6. Intertested:
    •Container features in Linux Kernel (namespace, cgroup, capability, ...)
    •System calls
    •Kernel programming interfaces
    •eBPF (<= New!!)
    •The most favorite struct: struct task_struct

    View full-size slide

  7. ToC
    •Rough overview of Container tracing (5m~)
    •Introducing to eBPF
    •Comparison to existing tracers
    •Kernel events (~ 5m)
    •Use cases with some DEMO (~ 10m)

    View full-size slide

  8. Tracing
    Your containers

    View full-size slide

  9. Why tracing?
    •τϨʔεʹ͸ҎԼͷΑ͏ͳ໨త͕͋Δ
    •ϩΪϯά: ෳࡶͳΞϓϦέʔγϣϯͰԿ͕͓͖͍ͯΔ͔೺Ѳ
    •؂ࠪɾηΩϡϦςΟ: ඞཁͳτϨʔεϩάΛग़͢͜ͱͰɺෆଌͷࣄଶ
    ͕͋ͬͨ৔߹ʹޙ͔Βௐ͕ࠪͰ͖Δɻ·ͨɺෆਖ਼ͳΞΫηε౳Λݕ஌
    Ͱ͖Δ͜ͱ΋͋Δ
    •σόοάɾύϑΥʔϚϯε: ୯७ͳΞϓϦέʔγϣϯϩάͰΘ͔Βͳ
    ͍಺༰Λ୳Δ

    View full-size slide

  10. What to trace?
    Kubernetes/
    API
    Host
    Linux
    Per-Container
    Apps
    (Networking)

    View full-size slide

  11. Kubernetes audit - orchestrator

    View full-size slide

  12. Falco / sysdig - host, containers

    View full-size slide

  13. Falco as a audit tool
    •ϧʔϧϕʔεͰ༷ʑͳ΋ͷΛ؂ࠪɻ
    •ϑΝΠϧૢ࡞ɺϓϩηεɺsyslog...
    •ref: Wazuh/OSSec https://wazuh.com/
    •ίϯςφʹಛԽͨ͠؂ࠪϧʔϧ
    •trusted_images, falco_sensitive_mount_images, ...
    https://github.com/falcosecurity/falco/blob/dev/rules/falco_rules.yaml

    View full-size slide

  14. Falco internal
    •؂ࠪ͢Δ৘ใͷιʔε͸େ͖͘͸ΧʔωϧϞδϡʔϧɻ
    •sysdig(~0.6), falco-probe(0.6~)
    •> The kernel modules are actually built from the same source code
    •eBPF΋಺෦Ͱ࢖͑ΔΑ͏ʹͳ͍ͬͯΔ
    • https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf/

    View full-size slide

  15. “Berkley Packet Filter”
    •ݩʑ͸ύέοτϑΟϧλͷख๏ͷ࿦จ (classic BPF, 1993)
    •Tcpdump ͷத਎ͱͯ͠׆༂
    •ύέοτϑΟϧλҎ֎: Seccomp Ͱ΋࢖ΘΕΔΑ͏ʹͳΔ
    •Linux 3.14 (2014)͔Βେ͖ͳมߋɺࠓͷܗʹۙͮ͘

    (extended BPF)
    ʮBerkeley Packet FilterʢBPFʣೖ໳ʢ1ʣʯ
    https://www.atmarkit.co.jp/ait/articles/1811/21/news010.html
    http://www.tcpdump.org/papers/bpf-usenix93.pdf

    View full-size slide

  16. eBPF overview
    •BPFόΠτίʔυΛͭ͘Δ

    ʢ৭ʑͳํ๏Ͱ࡞Δʣ
    •ΧʔωϧͰݕࠪ͞ΕɺඞཁʹԠ͡JIT
    •Χʔωϧ಺ͷΠϕϯτΛϓϩάϥϜ͕ऩू
    •BPF map ͱ͍͏໊લͷ

    Χʔωϧ಺ूੵମ͕͋Δʢͱͬͯ΋ߴ଎ʣ
    From: https://www.atmarkit.co.jp/ait/articles/1811/21/news010_2.html

    View full-size slide

  17. Tools
    •bpftrace(8) - ಺෦ͰeBPFΛ࢖͏൚༻తτϨʔαʔ
    •DTraceݴޠͦͬ͘ΓͷεΫϦϓτͰτϨʔε಺༰Λهड़
    •BCC - eBPF ͷػೳΛϥοϓͨ͠ϓϩάϥϜΛ࡞ΔͨΊͷϥΠϒϥϦ
    •Python, Lua, C++
    •Ruby ࣮૷ - RbBCC (੿࡞)

    View full-size slide

  18. Existing Linux tracers
    Tool Ability Key sys call Invasivity
    gdb
    ϓϩάϥϜͷεςοϓ࣮ߦɺ

    γάφϧͳͲͰͷఀࢭ
    ptrace(2) Large
    strace γεςϜίʔϧͷ௥੻ ptrace(2) Large
    perf
    ύϑΥʔϚϯεΧ΢ϯλͳͲͷ

    ूܭͱՄࢹԽ
    perf_event_open(2) Medium
    bpftrace/BCC
    ͋ΒΏΔΧʔωϧΠϕϯτͷ

    ूܭͱՄࢹԽ
    bpf(2) Smaller

    View full-size slide

  19. Comparison to gdb/strace
    •gdb/strace ྆ํͱ΋伴ͱͳΔγεςϜίʔϧ͸ ptrace(2)
    •࢓૊Έ্ɺҰ౓ϓϩάϥϜΛࢭΊΔඞཁ͕͋Δ
    •ࢭΊ͍ͯΔ͔Βͦ͜ྫ͑͹ϨδελΛߋ৽ͨ͠ΓɺΑΓϓϩάϥϜͷ
    ڍಈʹ౿ΈࠐΜͩૢ࡞͕ՄೳͰ΋͋Δ
    ʮptraceγεςϜίʔϧೖ໳ʯ
    https://itchyny.hatenablog.com/entry/2017/07/31/090000

    View full-size slide

  20. Comparison to perf
    •perf ͸ tracepoint ͳͲɺ eBPF ͕औಘͰ͖ΔΑ͏ͳ৘ใͷଟ͘Λಉ͡
    Α͏ʹऔಘͰ͖Δ
    •Ұํɺूܭ͸ɺྫ͑͹ϓϩʔϒ͝ͱʹ perf_event_open(2) ͯ͠ɺ

    ϢʔβϥϯυͰूܭ͢ΔͳͲΦʔόϔου͕ແࢹͰ͖ͳ͍

    ʮ؍ଌऀޮՌʯ
    •eBPF͸ΧʔωϧͰϑΟϧλɺूܭ(eBPF map)͕Ͱ͖Δɻ

    DTrace ʹ͍ۙɻ

    View full-size slide

  21. eBPF and
    Kernel events

    View full-size slide

  22. eBPF event source
    http://www.brendangregg.com/blog/2019-07-15/bpf-performance-tools-book.html

    View full-size slide

  23. Important source for tracing
    •perf, ftrace, eBPF Ͱಉ͡ιʔεΛ࢖͏
    ʮperf, ftraceͷ͘͠Έʯ http://mmi.hatenablog.com/entry/2018/03/04/052249

    View full-size slide

  24. tracepoint
    •LinuxΧʔωϧʹ͸ɺ಺෦Ͱى͜Δ༷ʑͳΠϕϯτΛ

    τϨʔε͢ΔͨΊͷϑοΫϙΠϯτ͕૊Έࠐ·Ε͍ͯΔɻ
    •ͦΕΒΛ tracepoint ͱݺͿɻΧʔωϧͷཚ਺ػೳΛ࢖ͬͨ࣌ͷΠϕϯ
    τͷྫ

    View full-size slide

  25. kprobe
    •tracepoint͸جຊతʹ͋Β͔͡ΊΧʔωϧ։ൃऀ͕༻ҙͨ͠

    ϑοΫϙΠϯτ͔͠τϨʔεͰ͖ͳ͍ɻ
    •ࣗ෼ͰɺಛఆͷΧʔωϧؔ਺ͷݺͼग़͠ΛτϨʔε͍ͨ͠৔߹
    kprobe Λ࢖͏ɻόʔδϣϯɺΞʔΩςΫνϟͰҟͳΔ͜ͱʹ஫ҙ͢Δ

    View full-size slide

  26. uprobe
    •ϢʔβۭؒͷϓϩάϥϜͷڍಈΛɺΧʔωϧଆͰ௥͍͔͚ΒΕΔ
    •uprobe ͸ɺόΠφϦ୯Ґʢਖ਼֬ʹ͸ͦͷ࣮ߦϑΝΠϧͷinode୯Ґͱ
    ͷ͜ͱʣͰΠϕϯτΛొ࿥͢Δඞཁ͕͋Δɻ
    •ྫ͑͹ɺόΠφϦͰݟ͍͑ͯΔؔ਺Λొ࿥͢Δ

    View full-size slide

  27. USDT
    •User Statically Defined Tracepoint
    •ϢʔβϓϩάϥϜͷ೚ҙͷՕॴʹprobeΛ࢓ֻ͚ɺΦʔόʔϔουগ
    ͳ͘ར༻͢Δ͜ͱ͕Ͱ͖Δɻʢத਎ͱͯ͠͸uprobeʹͳΔ໛༷ʣ

    View full-size slide

  28. Others
    •perfͰ࢖͏Α͏ͳϋʔυ΢ΣΞ΍ιϑτ΢ΣΞΧ΢ϯλͳͲ΋eBPF͔
    Βѻ͑Δɻ
    •bpftrace ͷϚχϡΞϧͰ͸ɺhardwareϓϩόΠμɺ softwareϓϩόΠ
    μɺϝϞϦͷwatchpointϓϩόΠμ͕ଘࡏ͢Δ

    View full-size slide

  29. “Raw” usage of tracefs
    •tracefs Λܦ༝ͯ͠ɺeBPFͳ͠Ͱ΋ΧʔωϧτϨʔεՄೳ

    (debugfs͔Βݟ͑Δ΋ͷͱಉ͡ɺΑΓݶఆతͳػೳ͔͠ݟͤͳ͍)
    ʮࣗ෼ͷͨΊͷΧʔωϧτϨʔγϯάɺͦͷ1ʯ

    https://udzura.hatenablog.jp/entry/2019/09/02/174801
    echo "p:myprobe1 $sym" >> \
    /sys/kernel/debug/tracing/kprobe_events
    ʮftrace Λ࢖ͬͨίϯςφ಺σόοάͷ४උʯ
    https://speakerdeck.com/kentatada/container-debug-using-ftrace

    View full-size slide

  30. ping͕connectΛଧͭτϨʔε

    View full-size slide

  31. OK,
    what is good
    with containers?

    View full-size slide

  32. eBPF use case
    •Debugging HOST Linux itself
    •Syscalls or kernel functions around containers
    •Runtime performance
    •bpftrace result to Prometheus for monitoring
    •Tracing events per container
    •Cgroup v2 with eBPF
    •Tracee by AquaSeciruty

    View full-size slide

  33. Tracing kernel on containers
    •ίϯςφ͸༷ʑͳΧʔωϧػೳΛ࢖͏ͷͰɺͦͷΧʔωϧػೳࣗମΛ
    σόοάͨ͠Γܭଌͨ͠Γ͢Δ͜ͱ͕eBPFͰͰ͖Δɻ
    •ྫ͑͹: `ip netns add/del`
    •಺෦Ͱ copy_net_ns/cleanup_net ͱ͍͏Χʔωϧؔ਺ΛݺͿ
    •͜ΕΒ͸͞Βʹ಺෦Ͱ͸ΧʔωϧͷόʔδϣϯʹΑΓϩοΫΛऔΔͷ
    ͰɺύϑΥʔϚϯεӨڹͳͲΛௐ΂͍ͨˠ eBPF Ͱʂ

    View full-size slide

  34. Reference
    •ʮLinux Kernel: rtnl_mutex Λ௕࣌ؒ ϩοΫͯࢗͬͨ͠͞ঢ়ଶΛ؍࡯͢Δʯ
    •https://hiboma.hatenadiary.jp/entry/2019/10/29/123455
    •ʢ༨ஊͰ͕͢hiboma͞Μͷ͓͔͛Ͱ /proc/$pid/stack ΍ wchan ͷ࢖͍ํΛ
    ೺Ѳ͠·ͨ͠ʣ

    View full-size slide

  35. Tracing Runtime
    •ʢ੿࡞ίϯςφHaconiwaͰʣҎԼΛܭଌͯ͠Έͨ
    •ίϯςφϥϯλΠϜͷىಈʙexecve͢Δ·Ͱͷ࣌ؒ
    •ίϯςφϥϯλΠϜͷىಈʙίϯςφ͕listen͢Δ·Ͱͷ࣌ؒ
    •USDTͱtracepointͷ

    ૊Έ߹Θͤ

    View full-size slide

  36. bpftrace script

    View full-size slide

  37. bpftrace → Prometheus
    •bt2prom ͱ͍͏πʔϧΛॻ͍ͨɻ
    •bpftraceͷు͖ग़͢JSONϑΥʔϚοτΛɺPrometheusՄ׵ͷϑΥʔ
    Ϛοτʹม׵ɻ
    •ͦͷ·· Textfile exporter ͷσΟϨΫτϦʹஔ͍ͨΒϓϩοτՄೳ
    •Cron ͳͲͰʢsarΈ͍ͨͳΠϝʔδͰʣఆظ࣮ߦ͢ΔͷΛ૝ఆ
    “Format bpftrace JSON into prometheus-compat textfile”
    https://github.com/udzura/mruby-bin-bt2prom

    View full-size slide

  38. ࡶʹ vfs_read ΛτϥοΫͨ͠ྫ

    View full-size slide

  39. CGroup v2 x eBPF
    •BPFͷcgroupઐ༻ؔ਺ - ࣮ߦ͞ΕͨεϨου͕ॴଐ͢Δcgroup͕Θ͔
    Δɻ BPF_FUNC_get_current_cgroup_id ΄͔
    •Χʔωϧ͕ΊͪΌ৽͘͠ͳ͍ͱ࢖͑ͳ͍... ͕ɺศར
    •ίϯςφ୯ҐͰɺͲͷΑ͏ͳϑΝΠϧ͕Φʔϓϯ͞ΕΔ͔ͷτϨʔε
    ͳͲ͕༰қʹͰ͖Δ
    •e.g. Apache HTTPDίϯςφ͕ϦΫΤετຖʹ։͘ϑΝΠϧͷsnoop

    View full-size slide

  40. Demo (2)
    ͕࣌ؒͳ͍Ͱ͢ɺੋඇ௚઀͓੠͕͚Λʂ

    View full-size slide

  41. Tracee
    •eBPFΛશ໘తʹ࢖͏ίϯςφτϨʔα࣮૷
    •಺෦ͰPID → NamespaceΛղܾͳͲ
    •bpftrace/BCC͸൚༻తͳͷͰɺ

    ಛԽͨ͠ػೳʹظ଴
    https://blog.aquasec.com/ebpf-tracing-containers

    View full-size slide

  42. Happy publishing!

    View full-size slide

  43. We’re moving to cgroup v2
    •Moby ͷ cgroup v2 ରԠP/R (WIP)
    •Systemd ͷ v2 default Խ (from 243)

    View full-size slide

  44. What is new in cgroup v2 (Reprise)
    •Unified Hierarchy
    •CGroup-aware OOM Killer
    •nsdelegate and better cgroup namespace
    •PSI - Pressure Stall Information
    •BPF helper for cgroup v2

    (such as BPF_FUNC_get_current_cgroup_id, ...)

    View full-size slide

  45. It should be “per-container”
    •Load Avarage
    •Memory usage
    •psutils, top, vmstat...
    •netstat, iostat
    •syslog, auditd
    •perf
    Host-wide Per-Container
    •Cgroup stat
    •PSI(especially)
    •eBPF (per container)
    •USDT, syscalls...
    •sysdig/falco
    •perf --cgroup

    View full-size slide

  46. Understand new feature
    to use new tools
    in a better way

    View full-size slide