Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Learning Linux @FB Debugging Content by Javier Honduvilla Coto javierhonduco@

Slide 3

Slide 3 text

- Traditional tools - The Kernel as an API - Userspace debugging - Kernel debugging - BPF - Performance troubleshooting - Additional resources Agenda

Slide 4

Slide 4 text

Traditional tools

Slide 5

Slide 5 text

uptime [javierhonduco@󰞵 ~] uptime 09:45:00 up 6 days, 3:25, 4 users, load average: 4.49, 3.64, 3.40

Slide 6

Slide 6 text

[h]top [javierhonduco@󰞵 ~] top Tasks: 540 total, 2 running, 530 sleeping, 1 stopped, 7 zombie %Cpu(s): 2.6 us, 3.9 sy, 0.0 ni, 93.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st MiB Mem : 57283.7 total, 4258.3 free, 22155.9 used, 30869.5 buff/cache MiB Swap: 25600.0 total, 16267.6 free, 9332.4 used. 34453.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 708530 root 20 0 456212 0 0 R 100.0 0.0 1540:20 oid.real 8419 root 20 0 1153836 46536 8620 S 7.6 0.1 447:36.76 dynolog 3842 smcproxy 20 0 3097676 402876 0 S 3.6 0.7 370:25.26 smc_proxy 3301357 scribe 20 0 2193228 123288 46192 S 2.6 0.2 0:37.18 scribed 43976 confige+ 20 0 2152296 152300 5876 S 2.3 0.3 118:53.75 configerator_pr 3246204 root 20 0 2314612 129148 20424 S 2.0 0.2 167:15.07 cppfbagentd 3263093 apache 20 0 10.5g 2.7g 355796 S 1.7 4.8 1:50.58 hphpi 942140 javierh+ 20 0 1546636 21188 3788 S 0.7 0.0 24:59.85 srclient 1793689 root 20 0 418848 5456 2364 S 0.7 0.0 16:31.23 rsyslogd 3444668 javierh+ 20 0 278308 5360 4132 R 0.7 0.0 0:00.05 top 3716658 root 20 0 1073152 13584 0 S 0.7 0.0 175:14.63 below 10 root 20 0 0 0 0 I 0.3 0.0 21:19.66 rcu_sched 4051 root 20 0 1131728 9444 0 S 0.3 0.0 7:30.31 fb_netnorad_res

Slide 7

Slide 7 text

iostat [javierhonduco@󰞵 ~] iostat 1 Linux 5.2.9-229_fbk15_hardened_4185_g357f49b36602 (devvm3554.lla0.facebook.com) 14/06/21 _x86_64_ (24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 3.96 0.33 2.93 0.32 0.00 92.46 Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn vda 214.23 8945.63 9785.19 4749732297 5195506674

Slide 8

Slide 8 text

vmstat [javierhonduco@󰞵 ~] vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 4 0 9554280 924204 3152 34698948 1 2 375 410 1 2 4 3 92 0 0

Slide 9

Slide 9 text

lsof [javierhonduco@󰞵 ~] sudo lsof | head COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME systemd 1 root cwd DIR 0,28 414 256 / systemd 1 root rtd DIR 0,28 414 256 / systemd 1 root txt REG 0,28 1877208 76395834 /usr/lib/systemd/systemd systemd 1 root mem REG 0,27 76395834 /usr/lib/systemd/systemd (path dev=0,28) systemd 1 root mem REG 0,27 257637 /usr/lib64/libnl-3.so.200.26.0 (path dev=0,28) systemd 1 root mem REG 0,27 257645 /usr/lib64/libnl-route-3.so.200.26.0 (path dev=0,28) systemd 1 root mem REG 0,27 6307144 /usr/lib64/libibverbs.so.1.11.32.0 (path dev=0,28)

Slide 10

Slide 10 text

The Kernel as an API

Slide 11

Slide 11 text

• Virtual filesystems (procfs, sysfs) • System calls APIs

Slide 12

Slide 12 text

Procfs [javierhonduco@󰞵 ~] ls /proc/$$/ attr cwd limits mountstats personality smaps_rollup uid_map auxv environ loginuid net projid_map stack wchan cgroup exe make-it-fail ns root stat clear_refs fail-nth map_files numa_maps sched statm cmdline fd maps oom_adj schedstat status comm fdinfo mem oom_score sessionid syscall coredump_filter gid_map mountinfo oom_score_adj setgroups task cpuset io mounts pagemap smaps timerslack_ns

Slide 13

Slide 13 text

Procfs [javierhonduco@󰞵 ~] ls -lah /proc/$$/exe lrwxrwxrwx 1 javierhonduco users 0 May 26 03:35 /proc/3355607/exe -> /usr/bin/bash [javierhonduco@󰞵 ~] ls -l /proc/self/exe lrwxrwxrwx 1 javierhonduco users 0 May 26 03:52 /proc/self/exe -> /usr/bin/ls

Slide 14

Slide 14 text

Procfs [javierhonduco@󰞵 ~] cat /proc/self/status | head -3 Name: cat Umask: 0022 State: R (running)

Slide 15

Slide 15 text

Procfs [javierhonduco@󰞵 ~] cat /proc/self/comm cat

Slide 16

Slide 16 text

Procfs [javierhonduco@󰞵 ~] cat /proc/self/cmdline -u cat/proc/self/cmdline-u[javierhonduco@devvm3554.lla0 ~]

Slide 17

Slide 17 text

Procfs [javierhonduco@󰞵 ~] cat /proc/self/cmdline -u | xxd 00000000: 6361 7400 2f70 726f 632f 7365 6c66 2f63 cat./proc/self/c 00000010: 6d64 6c69 6e65 002d 7500 mdline.-u.

Slide 18

Slide 18 text

Procfs [javierhonduco@󰞵 ~] ls /proc/self/task 3398539

Slide 19

Slide 19 text

Procfs [javierhonduco@󰞵 ~] ls /proc/self/fd 0 1 2 3 [javierhonduco@󰞵 ~] ls -l /proc/self/fd/0 lrwx------ 1 javierhonduco users 64 May 26 03:51 /proc/self/fd/0 -> /dev/pts/16

Slide 20

Slide 20 text

Procfs [javierhonduco@󰞵 ~] cat /proc/$$/cgroup 0::/user.slice/user-134475.slice/session-22775.scope/init.scope

Slide 21

Slide 21 text

Userspace debugging

Slide 22

Slide 22 text

close(10) Kernel Performs actions Through libc* Schedules process Next instruction Kernel space User space System calls (Simplified diagram 😅)

Slide 23

Slide 23 text

Userspace – Syscalls tracing ● strace ● perf trace

Slide 24

Slide 24 text

Userspace – System calls

Slide 25

Slide 25 text

Userspace – System calls [javierhonduco@󰞵 ~/learning_linux] strace ./app execve("./app", ["./app"], 0x7ffd7e8a8600 /* 53 vars */) = 0 brk(NULL) = 0x602000 [...] openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\2405\2\0\0\0\0\0"..., [...] openat(AT_FDCWD, "/tmp/not_present_file", O_RDONLY) = -1 ENOENT (No such file or directory) [...] openat(AT_FDCWD, "/tmp/create_file", O_RDONLY|O_CREAT, 014140) = 3 write(1, "Got fd=3 around line=11\n", 24Got fd=3 around line=11 ) = 24 exit_group(24) = ? +++ exited with 24 +++

Slide 26

Slide 26 text

Userspace – System calls [javierhonduco@󰞵 ~/learning_linux] strace -e trace=openat ./app | tail -2 openat(AT_FDCWD, "/tmp/not_present_file", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/tmp/create_file", O_RDONLY|O_CREAT|O_TRUNC, 064140) = 3

Slide 27

Slide 27 text

Userspace – System calls [javierhonduco@󰞵 ~/learning_linux] strace -c ./app % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 92.14 0.000129 32 4 1 openat 7.86 0.000011 5 2 write 0.00 0.000000 0 2 read 0.00 0.000000 0 2 close 0.00 0.000000 0 3 fstat 0.00 0.000000 0 1 lseek 0.00 0.000000 0 5 mmap 0.00 0.000000 0 4 mprotect 0.00 0.000000 0 1 munmap 0.00 0.000000 0 4 brk 0.00 0.000000 0 1 1 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 2 1 arch_prctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000140 4 32 3 total

Slide 28

Slide 28 text

Userspace – System calls [javierhonduco@󰞵 ~/learning_linux] strace -e trace=/open.* -k ./app [...] openat(AT_FDCWD, "/tmp/not_present_file", O_RDONLY) = -1 ENOENT (No such file or directory) > /usr/lib64/libc-2.28.so(open+0x52) [0xed2e2] > /home/javierhonduco/learning_linux/app(main+0x1b) [0x4005f1] > /usr/lib64/libc-2.28.so(__libc_start_main+0xf2) [0x23492] > /home/javierhonduco/learning_linux/app(_start+0x2d) [0x40051d]

Slide 29

Slide 29 text

Strace’s performance ⚠ Expect ~10x to ~50x overhead!

Slide 30

Slide 30 text

Strace + EINTR

Slide 31

Slide 31 text

Perf trace [javierhonduco@󰞵 ~] perf trace ./app |& grep open 0.096 ( 0.011 ms): app/3620882 openat(dfd: CWD, filename: 0xb1ffce40, flags: RDONLY|CLOEXEC) = 3 0.223 ( 0.007 ms): app/3620882 openat(dfd: CWD, filename: 0xb1df28e3, flags: RDONLY|CLOEXEC) = 3 0.253 ( 0.007 ms): app/3620882 openat(dfd: CWD, filename: 0xb1ff84e0, flags: RDONLY|CLOEXEC) = 3 0.349 ( 0.007 ms): app/3620882 openat(dfd: CWD, filename: 0xb1ff89d0, flags: RDONLY|CLOEXEC) = 3 0.435 ( 0.006 ms): app/3620882 openat(dfd: CWD, filename: 0xb1ff8ec0, flags: RDONLY|CLOEXEC) = 3 0.775 ( 0.006 ms): app/3620882 openat(dfd: CWD, filename: 0x4006e8) = -1 ENOENT (No such file or directory) 0.869 ( 0.063 ms): app/3620882 openat(dfd: CWD, filename: 0x400718, flags: RDONLY|CREAT|TRUNC, mode: IFLNK|ISUID|IXUSR|IRGRP|0xb1dc0000) = 3

Slide 32

Slide 32 text

Userspace – Process debugging ● Procfs ● gdb ● quickstack

Slide 33

Slide 33 text

GDB [javierhonduco@󰞵 ~] gdb --args ./app [...] (gdb) b main Breakpoint 1 at 0x4005de: file app.c, line 7. (gdb) run Starting program: /home/javierhonduco/learning_linux/app [...] (gdb) bt #0 main () at app.c:7 (gdb) info locals fail_fd = 0 fd = 0

Slide 34

Slide 34 text

GDB – dump all stacks [javierhonduco@󰞵 ~] gdb --args ./app [...] (gdb) thread apply all bt Thread 1 (process 267065): #0 main () at app.c:7

Slide 35

Slide 35 text

GDB – printing [javierhonduco@󰞵 ~] gdb --args ./app 1 2 3 4 5 6 [...] (gdb) p $rdi $1 = 6 (gdb) p *((char **)$rdx) $2 = 0x7fffffffdd95 "SHELL=/bin/bash" (gdb) p *(char **)$rdx@6 $4 = { 0x7fffffffdd95 "SHELL=/bin/bash", 0x7fffffffdda5 "LESS=-n -i", 0x7fffffffddb0 "HISTCONTROL=ignoredups:erasedups", 0x7fffffffddd1 "PHPSH_CEILING_DIRECTORIES=/home", 0x7fffffffddf1 "TMUX=/tmp/tmux-134475/default,708016,0", 0x7fffffffde18 "HISTSIZE=" }

Slide 36

Slide 36 text

GDB – layouts: assembly view

Slide 37

Slide 37 text

GDB – invoke a function [javierhonduco@󰞵 ~] gdb --args ./app [...] (gdb) call (void)puts("hi") hi

Slide 38

Slide 38 text

GDB – cores [javierhonduco@󰞵 ~] gdb -c my_corefile [...] (gdb) bt

Slide 39

Slide 39 text

Understanding the Kernel

Slide 40

Slide 40 text

dmesg [javierhonduco@󰞵 ~] dmesg --human --color=always | tail [ +0.000003] BTRFS error (device loop0): bad tree block start, want 134725632 have 0 [ +0.004627] BTRFS error (device loop0): bad tree block start, want 134725632 have 0 [ +0.002381] BTRFS error (device loop0): bad tree block start, want 134725632 have 0 [ +0.002302] BTRFS error (device loop0): bad tree block start, want 134725632 have 0 [ +0.002299] BTRFS error (device loop0): bad tree block start, want 134725632 have 0 [ +5.034803] systemd[1]: Created slice system-timers-service_cert_strobelight_server.slice.

Slide 41

Slide 41 text

drgn [javierhonduco@󰞵 ~] sudo drgn [...] >>> sum(disk.gendisk.part0.nr_sects for disk in for_each_disk(prog)) (sector_t)999705952

Slide 42

Slide 42 text

drgn [javierhonduco@󰞵 ~] sudo drgn [...] >>> prog['jiffies'] (volatile unsigned long)4892481323 >>> prog['jiffies'] (volatile unsigned long)4892483099

Slide 43

Slide 43 text

drgn [javierhonduco@󰞵 ~] sudo drgn [...] >>> find_task(prog, 1) *(struct task_struct *)0xffff888e3ffe0000 = { .thread_info = (struct thread_info){ .flags = (unsigned long)0, .status = (u32)0, }, .state = (volatile long)1, .stack = (void *)0xffffc90006244000, .usage = (refcount_t){ .refs = (atomic_t){ .counter = (int)2, }, }, .flags = (unsigned int)4194560, .ptrace = (unsigned int)0,

Slide 44

Slide 44 text

BPF

Slide 45

Slide 45 text

Event - Function* called - Function* returned - ... BPF code Triggers BPF Userspace Loads Userspace’ Sends data

Slide 46

Slide 46 text

Bpftrace

Slide 47

Slide 47 text

Bpftrace – syntax probe / pid == 3 /{ printf("hi”); }

Slide 48

Slide 48 text

Bpf/trace – probe types ● Userspace ○ Uprobe ○ Uretprobe ○ USDTs ● Kernel ○ Kprobe ○ Kretprobe ○ Tracepoint ○ Timers (interval) ○ ...

Slide 49

Slide 49 text

Bpftrace – open tracing [javierhonduco@󰞵 ~] sudo bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }' Attaching 1 probe... fbagentc /sys/kernel/mm/transparent_hugepage/enabled smcc /proc/sys/vm/overcommit_memory smcc /sys/kernel/mm/transparent_hugepage/enabled smcc /proc/sys/vm/overcommit_memory smcc /sys/kernel/mm/transparent_hugepage/enabled smcc /proc/sys/vm/overcommit_memory smcc /sys/kernel/mm/transparent_hugepage/enabled ^C

Slide 50

Slide 50 text

Bpftrace – uprobes [javierhonduco@󰞵 ~] sudo bpftrace -e ‘uprobe:/proc/113/exe:addTwo { @ = count(); }' ^C @: 102

Slide 51

Slide 51 text

Bpftrace – aggregations - count() - sum() - avg() - min() - max() - hist() - lhist()

Slide 52

Slide 52 text

Bpftrace – builtins - pid - tid - ustack - ...

Slide 53

Slide 53 text

Bpftrace – functions - printf() - time() - signal() - ...

Slide 54

Slide 54 text

Performance debugging

Slide 55

Slide 55 text

• strace -fc/-t • bpftrace • Tools that use procfs and other VFS • perf Performance debugging

Slide 56

Slide 56 text

Perf – top

Slide 57

Slide 57 text

Perf – CPU profiling (record) [javierhonduco@󰞵 ~] perf record -g /bin/timeout 1 yes > /dev/null Using perf wrapper that supports hot-text. Try perf.real if you encounter any issues. [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.103 MB perf.data (1470 samples) ]

Slide 58

Slide 58 text

Perf – CPU profiling (report)

Slide 59

Slide 59 text

Perf – CPU profiling (flamegraphs)

Slide 60

Slide 60 text

Perf – CPU profiling (annotate)

Slide 61

Slide 61 text

Perf – stat [javierhonduco@󰞵 ~] sudo perf stat /bin/timeout 1 yes > /dev/null Using perf wrapper that supports hot-text. Try perf.real if you encounter any issues. Performance counter stats for '/bin/timeout 1 yes': 1,001.64 msec task-clock # 0.999 CPUs utilized 9 context-switches # 0.009 K/sec 0 cpu-migrations # 0.000 K/sec 184 page-faults # 0.184 K/sec cycles instructions branches branch-misses 1.002260238 seconds time elapsed 0.369696000 seconds user 0.632553000 seconds sys

Slide 62

Slide 62 text

Top-down analysis - Start with the “classic” tools when there’s an issue - Go deeper with the more advanced tools

Slide 63

Slide 63 text

- Linux APIs with loads of examples: The Linux Programming Interface - Drgn: https://drgn.readthedocs.io/ - Bpftrace reference guide: https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md - Linux debugging zines: https://wizardzines.com/zines/debugging/ - Performance: all of http://www.brendangregg.com/ - Some man pages are very good, such as proc(5)’s - BCC tools / bpftrace tools Additional resources

Slide 64

Slide 64 text

No content