Slide 1

Slide 1 text

Ruby x BPF in Action Uchio Kondo from Mirrativ, Inc. @ RubyKaigi 2022.09.09

Slide 2

Slide 2 text

Uchio Kondo Infra & Streaming team @ Mirrativ, Inc Speaker @ RubyKaigi 2016, 2018, 2019, 2021 “Hacker” Supporter @ Engineer’s Café Fukuoka Ruby & Rust enthuasist, Linux freak Live in Fukuoka

Slide 3

Slide 3 text

(After all,) What can BPF do? §1

Slide 4

Slide 4 text

e.g. Visualize “SYN queue” From cloudflare’s blog “SYN packet handling in the wild”: https://blog.cloudflare.com/syn-packet-handling-in-the-wild/

Slide 5

Slide 5 text

2 servers in different config ● Same server (WEBRick 1.7.0, ruby 3.1.2) ○ ● Same bench parameter: ○ ● Different value of net.core.somaxconn ○ 4,096 vs 500, how this makes effect?

Slide 6

Slide 6 text

Compare results somaxconn = 500 somaxconn = 4096

Slide 7

Slide 7 text

Tool to Visualize “SYN queue”

Slide 8

Slide 8 text

C to Visualize “SYN queue”

Slide 9

Slide 9 text

Let’s watch queue satuation somaxconn = 500 somaxconn = 4096

Slide 10

Slide 10 text

demo

Slide 11

Slide 11 text

Now, you’ve completely understood BPF! …OK. Let me keep going.

Slide 12

Slide 12 text

Quick Introduction to BPF §2

Slide 13

Slide 13 text

Big picture first https://whimsical.com/bpf-ABAvCvJFLcSie2ML9ee2fn

Slide 14

Slide 14 text

#1 History

Slide 15

Slide 15 text

#1 History https://www.tcpdump.org/papers/bpf-usenix93.pdf

Slide 16

Slide 16 text

#1 History https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commi t/?id=bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8

Slide 17

Slide 17 text

#2 How it works

Slide 18

Slide 18 text

#2 How it works ref. https://speakerdeck.com/chikuwait/learn-ebpf?slide=17 by Yuki Nakata, 2020 emoji from https://github.com/twitter/twemoji/tree/master/assets (*) Very simplified Scripting Bytecode BPF VM BPF Map User Interface Collectiong Kernel Data… … or perf buffer, etc. The Userland The Kingdom of Kernel

Slide 19

Slide 19 text

#2 How it works BTF requires: Kernel version >= 5.6 && CONFIG_DEBUG_INFO_BTF should be enabled

Slide 20

Slide 20 text

#3 Comparison

Slide 21

Slide 21 text

#3 Comparison

Slide 22

Slide 22 text

#3 Comparison

Slide 23

Slide 23 text

#4 (Expanding) Use cases…

Slide 24

Slide 24 text

#4 (Expanding) Use cases… ● BPF-based network & security for containers ● the de facto Kubernetes threat detection engine using BPF

Slide 25

Slide 25 text

#4 (Expanding) Use cases… rbperf (https://github.com/javierhonduco/rbperf) by Jabier H. Coto with

Slide 26

Slide 26 text

RbBCC and BPF: Observe Everything §3

Slide 27

Slide 27 text

What is RbBCC? ● A: BCC for Ruby (libbcc FFI binding for Ruby) ● WHAT is BCC? ○ BPF Compiler Collection: ○ An SDK to make BPF tools, using Script Languages (Python/Lua supported officially) ○ But - Ruby is not in its support list, so I’m developping I’m going to show How to use – How to write BPF Ruby codes.

Slide 28

Slide 28 text

RbBCC’s “4 Keys”

Slide 29

Slide 29 text

#1 kprobe ● kprobe - (mainly) function trace in Linux Kernel ● e.g. __ARCH_sys_execve() ○ The substantial function called when execve(2) is invoked

Slide 30

Slide 30 text

Example RbBCC Program:

Slide 31

Slide 31 text

Example RbBCC Program: C Part: tracing function run inside kernel

Slide 32

Slide 32 text

Example RbBCC Program: Ruby Part: Load the C part above and Get feedback and data from BPF program inside kernel

Slide 33

Slide 33 text

demo

Slide 34

Slide 34 text

#1 kprobe ● ex.2 tcp_v4_conn_request (the first demo)

Slide 35

Slide 35 text

Trace the connect ● Also has 2 parts: ○ BPF DSL in C ○ Load program & handle data in Ruby

Slide 36

Slide 36 text

Return to first demo

Slide 37

Slide 37 text

#2 tracepoint (for kernel) ● Different stuff from Ruby’s TracePoint class ● A static entrypoint to trace kernel events ● It won’t change in the future version of Linux ○ kprobe traces an exported symbol of kernel, so it should be changed and maybe unstable.

Slide 38

Slide 38 text

#2 tracepoint (for kernel) ● Tracing syscall invocations: ● raw_syscalls/sys_enter ● raw_syscalls/sys_exit

Slide 39

Slide 39 text

tracepoint demo ● Example output (compared with strace -w):

Slide 40

Slide 40 text

● Example: tracing WEBrick (again): ○ ruby: ○ ab: ● Tracing command: ○ ruby: ○ strace: FYI: Performance sideeffect

Slide 41

Slide 41 text

FYI: Performance sideeffect w/ RbBCC w/ strace

Slide 42

Slide 42 text

#3 uprobe ● uprobe: A mechanism to attach to user space function calls

Slide 43

Slide 43 text

#3 uprobe ● Using uprobe (and USDT afterwards) with ease, build a special Ruby binary with a specific option:

Slide 44

Slide 44 text

#3 uprobe ● Tracing rb_str_new(const char *ptr, long len)

Slide 45

Slide 45 text

#3 uprobe ● Tracing rb_str_new(const char *ptr, long len)

Slide 46

Slide 46 text

#3 uprobe ● Collecting rb_str_new()’s: (function return timestamp - function entry timestamp) ● This represents the latency of a function call ● function entry = uprobe, function return = uretprobe

Slide 47

Slide 47 text

#3 uprobe ● Example of rb_str_new()’s latency histogram: ruby -e ‘p “Hello”’ ruby --disable gems -e ‘p “Hello”’

Slide 48

Slide 48 text

#4 USDT ● USDT: Userspace Statically Defined Tracepoint ○ Probe points that an author of a program embedded in advance ○ cf. uprobe traces real function call dynamically ● USDT for uprobe is just as Tracepoint for kprobe Dynamic Static Kernel space kprobe tracepount User space uprobe USDT

Slide 49

Slide 49 text

#4 USDT ● Ruby’s USDT (first for DTrace, but available via BPF in Linux) Japanese article: https://magazine.rubyist.net/articles/0041/0041-200Special-dtrace.html https://rubyreferences.github.io/rubyref/advanced/dtrace.html

Slide 50

Slide 50 text

#4 USDT ● Example: USDTs about GC: ○ usdt:./bin/ruby:ruby:gc__mark__begin ○ usdt:./bin/ruby:ruby:gc__mark__end ○ usdt:./bin/ruby:ruby:gc__sweep__begin ○ usdt:./bin/ruby:ruby:gc__sweep__end ● They can be used to trace GC latency: ○ (mark_end_time - mark_begin_time)

Slide 51

Slide 51 text

#4 USDT ● Example: Real-time tracing of RSS, GC mark and sweep statics ○ Plumping up the Sinatra app process and visualize

Slide 52

Slide 52 text

USDT demo ● RSS pumped up and mark proc. took more time

Slide 53

Slide 53 text

Summary: ● BPF Observability has 4 keys of tracing source: ● RbBCC can access all of four. Just use Ruby (and small C). ● Use Ruby to trace Ruby. Dynamic Static Kernel space kprobe tracepount User space uprobe USDT

Slide 54

Slide 54 text

Observability in Action: Improve Gem’s Performance §4

Slide 55

Slide 55 text

Real World Tuning ● Well-Done Speedup Contest in RubyKaigi ● Theme: JSON parser Ruston ○ mainly implemented by … Rust. (it’s native gem) ○ Somewhat slow compared to de-facto json.rb

Slide 56

Slide 56 text

JSON vs Ruston (naive ver.)

Slide 57

Slide 57 text

JSON vs Ruston (naive ver.)

Slide 58

Slide 58 text

So, First command is…?

Slide 59

Slide 59 text

run perf ● perf is useful to grasp the overall bottleneck ● json’s flamegraph

Slide 60

Slide 60 text

run perf ● ruston’s flamegraph ● from_iter of Vec ● realloc ○ in vec’s grow

Slide 61

Slide 61 text

Let’s start tracing by BPF ● tracing focused function: malloc/free for this time (*) N is limited to 10,000 in solo measurement

Slide 62

Slide 62 text

Use “uprobe” of Rust ● tracing focused function: malloc/free for this time

Slide 63

Slide 63 text

Measure first, Then let’s refine codes!

Slide 64

Slide 64 text

Point 1: Reduce iter()/String ● Reduce iterator methods on Lex#peek() ○ peek() is called many times on lexing process…

Slide 65

Slide 65 text

Point 1: Reduce iter()/String ● This leads to reduce the usage of String ○ Use &[u8] instead

Slide 66

Slide 66 text

Point 1: Reduce iter()/String ● Then measure! malloc calloc free Ruston Before 750197 22 753491 Ruston After 110197 22 113596 cf. C JSON 20206 10022 34142 (*) N = 10,000

Slide 67

Slide 67 text

Point 2: Reduce realloc ● On longer case:

Slide 68

Slide 68 text

Point 2: Reduce realloc ● Implement realloc tracer

Slide 69

Slide 69 text

Point 2: Reduce realloc ● Try to reduce realloc to allocate in advance ○ Specify capacity via Vec::with_capacity()

Slide 70

Slide 70 text

Point 2: Reduce realloc ● Measure! … The effect seems limited. - To be continued - realloc elapsed(s) longer case Ruston w/o vec capacity 90002 0.080172 Ruston w/ vec capacity 40002 0.071188 cf. C JSON 2 0.052459

Slide 71

Slide 71 text

The result #2 ● Comparison before / after all; for case N = 50,000 user system total Ruston Before 0.277292 0.000000 0.277292 Ruston After All 0.051765 0.000000 0.051765 cf. C JSON 0.054263 0.000000 0.054263

Slide 72

Slide 72 text

Lessons learned ● Existing tools are useful (e.g. perf, strace, gdb…) ● To grasp detailed bottleneck, making simple BPF tool is effective. ● uprobe is an entrypoint to x-ray native programs’ performance e.g. C, C++ and Rust (also … Zig?) ● Just keep them in mind: measure, reproduce, measure.

Slide 73

Slide 73 text

Conclusion §5

Slide 74

Slide 74 text

BPF for Observability Gives Strong Power to us

Slide 75

Slide 75 text

Observability is Hard, but… Using Ruby, It’s Just a Fun Programming

Slide 76

Slide 76 text

Observability is Your Best Friend!!

Slide 77

Slide 77 text

Enjoy and Measure it!

Slide 78

Slide 78 text

Acknolegements: ● The Book “Linux Observability with BPF” ○ by David Calavera, Lorenzo Fontana ○ https://www.oreilly.com/library/view/linux-observability-wit h/9781492050193/ ● Brendan Gregg for his superb articles: ○ https://www.brendangregg.com/bpf-performance-tools-b ook.html ● Masashi Misono for his Japanese introduction to BPF ○ https://atmarkit.itmedia.co.jp/ait/articles/2004/09/news006 .html

Slide 79

Slide 79 text

Acknolegements: ● RbBCC received Ruby Association Grant in 2019 ○ report: https://www.ruby.or.jp/ja/news/20200508 ○ Maintored by Koichi “ko1” Sasada (Cookpad, Inc.) ○ Given some advices from Ryosuke Matsumoto (Sakura Internet), Takao Shimayoshi and Yoshiaki Kasahara (Kyushu Univ.)

Slide 80

Slide 80 text

Environment of this slide: ● Ruby: ○ 3.1.2 with dtrace enabled ● Linux: ○ CPU: aarch64 ○ Ubuntu 20.04.1 with kernel 5.8.0-63-generic ● Other libraries and softwares: ○ BCC(libbcc): 0.18.0 built with LLVM 9 ○ strace: 5.5 (from package manager) ○ perf: 5.8.18 (from package manager) ● Code Examples