Slide 1

Slide 1 text

[email protected] Linux System Monitoring with eBPF DevOpsDays Zurich, 2018-05-03 Heinrich Hartmann

Slide 2

Slide 2 text

[email protected] System Monitoring is about Kernel & Hardware

Slide 3

Slide 3 text

[email protected] Best Practice: The USE Method https://www.circonus.com/2017/08/system-monitoring-with-the-use-dashboard CPU Memory Network Disks Utilization Saturation Errors

Slide 4

Slide 4 text

[email protected] Best Practice: The USE Method https://www.circonus.com/2017/08/system-monitoring-with-the-use-dashboard CPU Memory Network Disks Utilization Saturation Errors

Slide 5

Slide 5 text

[email protected] Lot’s of Unknowns remaining https://www.circonus.com/2017/08/system-monitoring-with-the-use-dashboard ? ? ? ~ ~ ~ CPU Memory Network Disks Utilization Saturation Errors

Slide 6

Slide 6 text

[email protected] eBPF allows unparalleled insights https://github.com/iovisor/bcc Credits: - Brendan Gregg @ Netflix (Sun) - Sasha Goldshtein @ Sela, Microsoft - Brenden Blanco @ VMWare - Linus Torvalds, et. al.

Slide 7

Slide 7 text

[email protected] eBPF allows unparalleled insights https://github.com/iovisor/bcc Credits: - Brendan Gregg @ Netflix (Sun) - Sasha Goldshtein @ Sela, Microsoft - Brenden Blanco @ VMWare - Linus Torvalds, et. al.

Slide 8

Slide 8 text

[email protected] CPU: Scheduling Latency

Slide 9

Slide 9 text

[email protected] Disk: Block-I/O Latency

Slide 10

Slide 10 text

[email protected] Disk: Block-I/O Latency

Slide 11

Slide 11 text

[email protected] Disk: Block-I/O Latency over time

Slide 12

Slide 12 text

[email protected] Disk: Block-I/O Latency over time

Slide 13

Slide 13 text

[email protected] Don’t shout in the Datacenter Brendan Gregg (2008) https://www.youtube.com/watch?v=tDacjrSCeq4

Slide 14

Slide 14 text

[email protected] System Calls: The Kernel API Monitor Rate Errors Duration System Call API

Slide 15

Slide 15 text

[email protected] Syscalls: Rate / Count sched_yield (2tn) clock_time (1.5tn) recvfrom (300bn) 394 Metrics

Slide 16

Slide 16 text

[email protected] Syscalls: Duration 1 us 10 us

Slide 17

Slide 17 text

[email protected] Syscall durations span >8 orders of magnitude 1s 100 ms 10 us 1.5 tn events total

Slide 18

Slide 18 text

[email protected] File System: Latency

Slide 19

Slide 19 text

[email protected] Memory: Allocation Latency

Slide 20

Slide 20 text

[email protected] Further Reading Slides: @HeinrichHartman / #DevOpsDaysZH Code: https://github.com/circonus-labs/nad/.../bccbpf Blog: http://www.circonus.com/2018/05/linux-system-monitoring-with-ebpf/