Slide 1

Slide 1 text

JBM: JVM Blocking Monitor

Slide 2

Slide 2 text

About me » Yuto Kawamura » Principal Software Engineer @ LINE » Manager of the IMF Team = Kafka Infrastructure Team » ❤ Performance Engineering

Slide 3

Slide 3 text

Apache Kafka( at LINE) » Distributed messaging (queueing) middleware » Used as company-wide messaging/distributed messaging infrastructure » Persistency by disks » Written in Scala, runs on JVM

Slide 4

Slide 4 text

Cluster scale Messages 1.09 trillion / day In/Out data flow 2.2PB / day Users 220+ services

Slide 5

Slide 5 text

A lot of performance problems » TCP࣮૷ʹىҼ͢Δେن໛KafkaΫϥελʔͷϦΫΤετ஗Ԇղܾ· ͰͷಓͷΓ » Reliability Engineering Behind The Most Trusted Kafka Platform - LINE Developer Day 2019

Slide 6

Slide 6 text

Apache Kafka heavily relies on disk for persistency

Slide 7

Slide 7 text

Thread "blocking" by I/O

Slide 8

Slide 8 text

It could be even more complex

Slide 9

Slide 9 text

"Thread blocking" often cascades

Slide 10

Slide 10 text

Investigating blocking application requires a lot of effort » Application-level debugging » JVM-level inspections » Kernel-level inspections (eBPF, SystemTap) » Look into every single detail

Slide 11

Slide 11 text

Measuring only the "blocked" thread is » Insufficient » Not enough to reveal the whole causal relationship

Slide 12

Slide 12 text

Can't we have a snapshot of what was happening, at the time of blocking?

Slide 13

Slide 13 text

Detecting those "blocking" behavior from JVM layer? » JVMTI agent? » while true; jstack $PID; sleep 1; done?

Slide 14

Slide 14 text

JVM thread states? Which thread is a thread at when it's waiting disk I/O to complete? » NEW » RUNNABLE » BLOCKING » WAITING » TIMED_WAITING » TERMINATED

Slide 15

Slide 15 text

JVM thread states? Which thread is a thread at when it's waiting disk I/O to complete? » NEW » RUNNABLE : "A thread executing in the Java virtual machine is in this state." » BLOCKING » WAITING » TIMED_WAITING » TERMINATED

Slide 16

Slide 16 text

It is not possible to detect all kinds of blocking from JVM layer » JVM has no knowledge of what's happening inside the kernel » From JVM perspective, a thread calling system call (maybe I/O) is RUNNABLE

Slide 17

Slide 17 text

Backtrace from kernel to JVM

Slide 18

Slide 18 text

Overview of the desired tool

Slide 19

Slide 19 text

Building blocks 1. General blocking detection 2. Backtrace of kernel space functions 3. Backtrace of JVM methods

Slide 20

Slide 20 text

General blocking detection

Slide 21

Slide 21 text

General blocking detection » When an application thread blocks, it is ALWAYS at some point of calling kernel functions » read/write I/O => calling system call » waiting on a lock (mutex) => calling system call » page fault => executing kernel's page fault handler » Definition of the "thread is blocking" » = The thread can't consume CPU time anymore until the upcoming event occurs

Slide 22

Slide 22 text

eBPF » extended Berkeley Packet Filter » https://ebpf.io/ » In short, programmable VM which runs inside the Linux kernel

Slide 23

Slide 23 text

finish_task_switch function » A function of linux kernel, which is called when tasks (threads) switched on a cpu » https://github.com/torvalds/linux/ blob/ 0d85b27b0cc6b5cf54567c5ad913a 247a71583ce/kernel/sched/ core.c#L5154

Slide 24

Slide 24 text

Blocking time detection

Slide 25

Slide 25 text

Blocking time detection

Slide 26

Slide 26 text

Backtrace of kernel space functions

Slide 27

Slide 27 text

Backtrace of kernel space functions » eBPF supports backtrace function 1 » Both kernel space, and user space by BPF_F_USER_STACK flag 1 https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

Slide 28

Slide 28 text

bpf_get_stackid 0: [0x ae31f741] finish_task_switch 1: [0x aebeb169] __sched_text_start 2: [0x aebeb725] schedule 3: [0x aebebb22] io_schedule 4: [0x ae4959c3] wait_on_page_bit 5: [0x ae493798] __filemap_fdatawait_range 6: [0x ae49967a] file_write_and_wait_range 7: [0x c02da50d] xfs_file_fsync [xfs] 8: [0x ae5976e8] do_fsync 9: [0x ae597773] __x64_sys_fdatasync 10: [0x ae2043ab] do_syscall_64 11: [0x aec000a9] entry_SYSCALL_64_after_hwframe 12: [0x7f32b1af1a4f] fdatasync@:0 13: [0x7f329b86fa17] [unknown symbol name] 14: [0x7f329b8c1cd4] [unknown symbol name]

Slide 29

Slide 29 text

Backtrace of JVM methods

Slide 30

Slide 30 text

How can we take backtrace of a specific thread, at arbitrary time? » jstack? » => It takes stacktrace of the whole JVM, requires safepoint. » Thread.currentThread().getStackTrace() » => The target thread is in "blocking" state when we want to take backtrace

Slide 31

Slide 31 text

Complexity of collecting intuitive user-space stack traces Multiple sources of frames: » C library (glibc, native library part of the JVM) » Native binary from JIT compiled Java classes » Java classes on an interpreter

Slide 32

Slide 32 text

Implement JVM (user-space) backtrace by myself? I don't think so.

Slide 33

Slide 33 text

async-profiler for the rescue » JVM profiler featuring AsyncGetCallTrace + perf_events » https://github.com/jvm-profiling-tools/async-profiler

Slide 34

Slide 34 text

async-profiler takes stacktraces

Slide 35

Slide 35 text

async-profiler takes stacktraces => We can rely on the async-profiler library to take reliable backtrace of user-space! » But how async-profiler does stacktrace sampling? » => When we run profiler.sh -e cpu ...

Slide 36

Slide 36 text

async-profiler cpu profiling

Slide 37

Slide 37 text

Collecting backtraces of threads

Slide 38

Slide 38 text

async-profiler limitations » Currently it supports "start recording and dump on stop" only » Supported events (sampling timing): cpu, alloc, lock, wall However we need: » To take backtrace at arbitrary timing (when a thread is "blocked") » Streaming of backtraces to log an event

Slide 39

Slide 39 text

Patching async-profiler Added -e none and -o stream options. https://github.com/kawamuray/async-profiler/commits/master Dump each event in format like: { "timestamp": 12345, "tid": 1212, "frames": [ { bci: 1234, methodId: 1234, symbol: "java/lang/net...." } ] }

Slide 40

Slide 40 text

Implementation

Slide 41

Slide 41 text

Ways to implement eBPF programs » bcc - https://github.com/iovisor/ bcc » libbpf - https://github.com/libbpf/ libbpf

Slide 42

Slide 42 text

Implement eBPF app w/ Rust - Aya » https://github.com/aya-rs/aya » Rust development framework for eBPF application » We can implement eBPF program itself in Rust

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Experiment

Slide 49

Slide 49 text

Sample "blocky" application public class SampleBlockingApplication { ReentrantLock lock = new ReentrantLock(); new Thread(() -> writer(lock)).start(); new Thread(() -> locker(lock)).start(); }

Slide 50

Slide 50 text

WRITER private static void writer(Path path, ReentrantLock lock) throws IOException, InterruptedException { byte[] data = new byte[CHUNK_SIZE]; try (FileChannel channel = FileChannel.open(path, StandardOpenOption.CREATE, StandardOpenOption.WRITE)) { while (true) { lock.lock(); try { channel.position(0); channel.write(ByteBu er.wrap(data)); channel.force(false); // fsync() } finally { lock.unlock(); } Thread.sleep(1000); } } }

Slide 51

Slide 51 text

LOCKER private static void locker(ReentrantLock lock) throws InterruptedException { while (true) { lock.lock(); Thread.sleep(1); lock.unlock(); Thread.sleep(1000); } }

Slide 52

Slide 52 text

Simulate "slow" disk access » DDI - Disk Delay Injection » https://github.com/kawamuray/ddi sudo ./ddi-setup.sh create /path/to/mount echo 2000 | sudo tee /sys/fs/ddi/7:0/write_delay # Delay write iops by 2 second

Slide 53

Slide 53 text

JBM: Running jbm -p TARGET_JVM_PID -m 1000000 # 1 sec It will: 1. Install async-profiler JVMTI on target JVM 2. Install eBPF program 3. Listen for events from eBPF, and combine received event with event stream from JVMTI agent 4. Output event reports

Slide 54

Slide 54 text

=== 2023-06-02 17:36:41 1685695001988 PID: 11946, TID: 11963 (WRITER), DURATION: 2501820 us Native Stack: (snip) 4: [0x aebebae9] io_schedule_timeout 5: [0x aebec21d] wait_for_completion_io_timeout 6: [0x ae68a742] submit_bio_wait 7: [0x ae694da9] blkdev_issue_flush 8: [0x c02da5a5] xfs_file_fsync [xfs] 9: [0x ae5976e8] do_fsync 10: [0x ae597773] __x64_sys_fdatasync 11: [0x ae2043ab] do_syscall_64 12: [0x aec000a9] entry_SYSCALL_64_after_hwframe 13: [0x7f99b5615a4f] fdatasync@:0 (snip) -------------------------------------------------------------------------------- JVM Stack (took: 2023-06-02 17:36:41): 0: [0x7f9930b998a0] fdatasync 1: [0x7f9930c21090] sun.nio.ch.FileDispatcherImpl.force0 2: [0x7f9930c21080] sun.nio.ch.FileDispatcherImpl.force 3: [0x7f9930c23170] sun.nio.ch.FileChannelImpl.force 4: [0x7f99ac2623d8] SampleBlockingApplication.writer (snip)

Slide 55

Slide 55 text

=== 2023-06-02 17:36:41 1685695001989 PID: 11946, TID: 11965 (LOCKER), DURATION: 10270708 us Native Stack: (snip) 3: [0x ae3a1943] futex_wait_queue_me 4: [0x ae3a36cf] futex_wait 5: [0x ae3a4b89] do_futex 6: [0x ae3a5245] __x64_sys_futex 7: [0x ae2043ab] do_syscall_64 8: [0x aec000a9] entry_SYSCALL_64_after_hwframe 9: [0x7f99b5ce145c] pthread_cond_wait@@GLIBC_2.3.2@:0 -------------------------------------------------------------------------------- JVM Stack (took: 2023-06-02 17:36:41): 0: [0x7f9930be9d30] __pthread_cond_wait 1: [0x7f9930c46b58] jdk.internal.misc.Unsafe.park 2: [0x7f9930c40b68] java.util.concurrent.locks.LockSupport.park 3: [0x7f9930c2b940] java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt 4: [0x7f9930c2b948] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued 5: [0x7f9930c2ba78] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire 6: [0x7f9930c40f00] java.util.concurrent.locks.ReentrantLock.lock 7: [0x7f99ac2623e0] SampleBlockingApplication.locker (snip)

Slide 56

Slide 56 text

=== 2022-04-03 09:54:00.832 PID: 16179, TID: 16922 (kafka-log-clean), DURATION: 250391556 us Native Stack: (snip) 3: [0x 843800fd] io_schedule_timeout+0xad 4: [0x 84380198] io_schedule+0x18 5: [0x 8437eb61] bit_wait_io+0x11 (snip) 13: [0x 83dec70c] do_read_fault.isra.63+0x4c 14: [0x 83df11ba] handle_pte_fault+0x22a 15: [0x 83df3ecd] handle_mm_fault+0x39d 16: [0x 84388653] __do_page_fault+0x213 17: [0x 84388975] do_page_fault+0x35 18: [0x 84384778] page_fault+0x28 19: [0x7fb2903cf70c] [unknown] -------------------------------------------------------------------------------- JVM Stack (took: 2022-04-03 09:54:00.847): 0: [0x7face08c0e10] SafepointBlob 1: [0x7face0885d98] java.nio.DirectByteBu er.getLong 2: [0x7face07b89e0] kafka.log.TimeIndex.timestamp 3: [0x7face07b8900] kafka.log.TimeIndex.parseEntry 4: [0x7face07b88f8] kafka.log.TimeIndex.parseEntry 5: [0x7face07b9200] kafka.log.AbstractIndex.indexSlotRangeFor (snip) 27: [0x7face07b1540] kafka.log.LogCleaner$CleanerThread.doWork 28: [0x7face07c2860] kafka.utils.ShutdownableThread.run

Slide 57

Slide 57 text

Achievement » Running on production Kafka clusters for 2+ years » No performance problem caused by overhead » Overhead is extremely low thanks to eBPF + on-demand sampling » Helped a lot of investigations

Slide 58

Slide 58 text

Takeaways » Unusually long blocking situation rarely happens, but leaves critical consequences for production applications when it occurs » Such rarely happen problems are difficult to investigate due to lack of clues, reproducability » Having consistent snapshot of stacktrace from kernel to JVM helps us to learn what has happened, exactly at the time

Slide 59

Slide 59 text

Takeaways » Building eBPF program is easy, and is super powerful » Combining existing tools enables you a new observations that couldn't be achieved until today

Slide 60

Slide 60 text

Wanna try JBM? It's an OSS :) https://github.com/kawamuray/jbm

Slide 61

Slide 61 text

End of presentation

Slide 62

Slide 62 text

Getting Java classes backtraces from eBPF » ucalls.py 4 » JVM supports USDT (User Statically Defined Traces) 5 » method-entry and method-return METHOD # CALLS TIME (us) java/io/Bu eredInputStream.getBufIfOpen 1 7.00 slowy/App.isSimplePrime 8970 8858.35 slowy/App.isDivisible 3228196 3076985.12 slowy/App.isPrime 8969 4841017.64 5 https://docs.oracle.com/javase/8/docs/technotes/guides/vm/dtrace.html 4 https://github.com/iovisor/bcc/blob/master/tools/lib/ucalls.py#L68

Slide 63

Slide 63 text

USDT probes » Requires -XX: +ExtendedDTraceProbes » Performance overhead? 6 6 https://twitter.com/brendangregg/status/932330965279502336

Slide 64

Slide 64 text

USDT probe based backtrace » Complex implementation » Store/Remove every method invocation in thread local storage » Handling of ping-pong style invocation: JavaMethod A -> Native Function A -> Java Method B » Must hook and record all method-entry, method-exit » We don't know when we'll need to get backtrace

Slide 65

Slide 65 text

Scaffolding » cargo generate https://github.com/aya-rs/aya-template » hello-project/ » hello-common - Shared definitions between userspace program and eBPF program » hello - Userspace program » hello-ebpf - eBPF program