Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JBM: General JVM blocking monitor powered by eBPF and async-profiler

JBM: General JVM blocking monitor powered by eBPF and async-profiler

LINEのKafkaチームでは、1兆件/day以上のメッセージを取り扱う世界最大規模のKafkaクラスタを運用しています。
我々がこれまでに直面したパフォーマンスに関連した問題の内、最も典型的なのはJVMのあるスレッドが何かしらの理由によりブロックしてしまう、ということに起因しています。
時には、あるアプリケーションスレッドがI/Oの完了を待つ間ブロックしていた、というような単純なものですが、場合によっては、safepoint中に行われた意図しないI/Oや、それに関連したlock競合によりJVM全体がブロックされ、所謂STW(Stop-the-World)になってしまうケースもありました。
そのような問題を数多く経験した後、JBM - JVM Blocking Monitor (https://github.com/kawamuray/jbm) というツールを開発するに至りました。

JBMは、Linux eBPFや、Javaのパフォーマンスプロファイラーとして広く人気のあるasync-profilerを組み合わせることによって、JVMアプリケーションに対する"汎用的なブロッキングの検知とバックトレースの取得"を実現し、I/O、lock待ち、ページフォルト等に起因するJVMスレッドの長時間に渡るブロッキングに対して、JVMレイヤだけでなく、kernelレイヤのバックトレースも含めて取得することができるようにデザインされています。

このセッションでは、JBMがどのように動作するか、またその裏でeBPF programやasync-profilerがどのように実装され動作しているかを詳細に解説し、JBMをJVMアプリケーションモニタリングにどのように活用できるかを紹介します。

Yuto Kawamura
Principal Software Engineer, LINE
Yuto Kawamura is a Principal Software Engineer working at LINE.
Currently he is the manager of the Kafka team, which he founded and led until 2022.
He gave presentations at JJUG 2019 Fall (https://github.com/jjug-ccc/slides-articles-2019Fall#1330-1415), Kafka Summit SF 2018/2019, and LINE Developer Day, trying to help people crossing boundary of application layers for understanding how is your software system works from top to the bottom.

※この資料は以下イベントで発表した内容です。
https://sessionize.com/api/v2/y7inyq6y/view/GridSmart

LINE Developers

June 04, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. About me » Yuto Kawamura » Principal Software Engineer @

    LINE » Manager of the IMF Team = Kafka Infrastructure Team » ❤ Performance Engineering
  2. Apache Kafka( at LINE) » Distributed messaging (queueing) middleware »

    Used as company-wide messaging/distributed messaging infrastructure » Persistency by disks » Written in Scala, runs on JVM
  3. Investigating blocking application requires a lot of effort » Application-level

    debugging » JVM-level inspections » Kernel-level inspections (eBPF, SystemTap) » Look into every single detail
  4. Measuring only the "blocked" thread is » Insufficient » Not

    enough to reveal the whole causal relationship
  5. Detecting those "blocking" behavior from JVM layer? » JVMTI agent?

    » while true; jstack $PID; sleep 1; done?
  6. JVM thread states? Which thread is a thread at when

    it's waiting disk I/O to complete? » NEW » RUNNABLE » BLOCKING » WAITING » TIMED_WAITING » TERMINATED
  7. JVM thread states? Which thread is a thread at when

    it's waiting disk I/O to complete? » NEW » RUNNABLE : "A thread executing in the Java virtual machine is in this state." » BLOCKING » WAITING » TIMED_WAITING » TERMINATED
  8. It is not possible to detect all kinds of blocking

    from JVM layer » JVM has no knowledge of what's happening inside the kernel » From JVM perspective, a thread calling system call (maybe I/O) is RUNNABLE
  9. Building blocks 1. General blocking detection 2. Backtrace of kernel

    space functions 3. Backtrace of JVM methods
  10. General blocking detection » When an application thread blocks, it

    is ALWAYS at some point of calling kernel functions » read/write I/O => calling system call » waiting on a lock (mutex) => calling system call » page fault => executing kernel's page fault handler » Definition of the "thread is blocking" » = The thread can't consume CPU time anymore until the upcoming event occurs
  11. eBPF » extended Berkeley Packet Filter » https://ebpf.io/ » In

    short, programmable VM which runs inside the Linux kernel
  12. finish_task_switch function » A function of linux kernel, which is

    called when tasks (threads) switched on a cpu » https://github.com/torvalds/linux/ blob/ 0d85b27b0cc6b5cf54567c5ad913a 247a71583ce/kernel/sched/ core.c#L5154
  13. Backtrace of kernel space functions » eBPF supports backtrace function

    1 » Both kernel space, and user space by BPF_F_USER_STACK flag 1 https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
  14. bpf_get_stackid 0: [0x ae31f741] finish_task_switch 1: [0x aebeb169] __sched_text_start 2:

    [0x aebeb725] schedule 3: [0x aebebb22] io_schedule 4: [0x ae4959c3] wait_on_page_bit 5: [0x ae493798] __filemap_fdatawait_range 6: [0x ae49967a] file_write_and_wait_range 7: [0x c02da50d] xfs_file_fsync [xfs] 8: [0x ae5976e8] do_fsync 9: [0x ae597773] __x64_sys_fdatasync 10: [0x ae2043ab] do_syscall_64 11: [0x aec000a9] entry_SYSCALL_64_after_hwframe 12: [0x7f32b1af1a4f] fdatasync@:0 13: [0x7f329b86fa17] [unknown symbol name] 14: [0x7f329b8c1cd4] [unknown symbol name]
  15. How can we take backtrace of a specific thread, at

    arbitrary time? » jstack? » => It takes stacktrace of the whole JVM, requires safepoint. » Thread.currentThread().getStackTrace() » => The target thread is in "blocking" state when we want to take backtrace
  16. Complexity of collecting intuitive user-space stack traces Multiple sources of

    frames: » C library (glibc, native library part of the JVM) » Native binary from JIT compiled Java classes » Java classes on an interpreter
  17. async-profiler for the rescue » JVM profiler featuring AsyncGetCallTrace +

    perf_events » https://github.com/jvm-profiling-tools/async-profiler
  18. async-profiler takes stacktraces => We can rely on the async-profiler

    library to take reliable backtrace of user-space! » But how async-profiler does stacktrace sampling? » => When we run profiler.sh -e cpu ...
  19. async-profiler limitations » Currently it supports "start recording and dump

    on stop" only » Supported events (sampling timing): cpu, alloc, lock, wall However we need: » To take backtrace at arbitrary timing (when a thread is "blocked") » Streaming of backtraces to log an event
  20. Patching async-profiler Added -e none and -o stream options. https://github.com/kawamuray/async-profiler/commits/master

    Dump each event in format like: { "timestamp": 12345, "tid": 1212, "frames": [ { bci: 1234, methodId: 1234, symbol: "java/lang/net...." } ] }
  21. Ways to implement eBPF programs » bcc - https://github.com/iovisor/ bcc

    » libbpf - https://github.com/libbpf/ libbpf
  22. Implement eBPF app w/ Rust - Aya » https://github.com/aya-rs/aya »

    Rust development framework for eBPF application » We can implement eBPF program itself in Rust
  23. Sample "blocky" application public class SampleBlockingApplication { ReentrantLock lock =

    new ReentrantLock(); new Thread(() -> writer(lock)).start(); new Thread(() -> locker(lock)).start(); }
  24. WRITER private static void writer(Path path, ReentrantLock lock) throws IOException,

    InterruptedException { byte[] data = new byte[CHUNK_SIZE]; try (FileChannel channel = FileChannel.open(path, StandardOpenOption.CREATE, StandardOpenOption.WRITE)) { while (true) { lock.lock(); try { channel.position(0); channel.write(ByteBu er.wrap(data)); channel.force(false); // fsync() } finally { lock.unlock(); } Thread.sleep(1000); } } }
  25. LOCKER private static void locker(ReentrantLock lock) throws InterruptedException { while

    (true) { lock.lock(); Thread.sleep(1); lock.unlock(); Thread.sleep(1000); } }
  26. Simulate "slow" disk access » DDI - Disk Delay Injection

    » https://github.com/kawamuray/ddi sudo ./ddi-setup.sh create /path/to/mount echo 2000 | sudo tee /sys/fs/ddi/7:0/write_delay # Delay write iops by 2 second
  27. JBM: Running jbm -p TARGET_JVM_PID -m 1000000 # 1 sec

    It will: 1. Install async-profiler JVMTI on target JVM 2. Install eBPF program 3. Listen for events from eBPF, and combine received event with event stream from JVMTI agent 4. Output event reports
  28. === 2023-06-02 17:36:41 1685695001988 PID: 11946, TID: 11963 (WRITER), DURATION:

    2501820 us Native Stack: (snip) 4: [0x aebebae9] io_schedule_timeout 5: [0x aebec21d] wait_for_completion_io_timeout 6: [0x ae68a742] submit_bio_wait 7: [0x ae694da9] blkdev_issue_flush 8: [0x c02da5a5] xfs_file_fsync [xfs] 9: [0x ae5976e8] do_fsync 10: [0x ae597773] __x64_sys_fdatasync 11: [0x ae2043ab] do_syscall_64 12: [0x aec000a9] entry_SYSCALL_64_after_hwframe 13: [0x7f99b5615a4f] fdatasync@:0 (snip) -------------------------------------------------------------------------------- JVM Stack (took: 2023-06-02 17:36:41): 0: [0x7f9930b998a0] fdatasync 1: [0x7f9930c21090] sun.nio.ch.FileDispatcherImpl.force0 2: [0x7f9930c21080] sun.nio.ch.FileDispatcherImpl.force 3: [0x7f9930c23170] sun.nio.ch.FileChannelImpl.force 4: [0x7f99ac2623d8] SampleBlockingApplication.writer (snip)
  29. === 2023-06-02 17:36:41 1685695001989 PID: 11946, TID: 11965 (LOCKER), DURATION:

    10270708 us Native Stack: (snip) 3: [0x ae3a1943] futex_wait_queue_me 4: [0x ae3a36cf] futex_wait 5: [0x ae3a4b89] do_futex 6: [0x ae3a5245] __x64_sys_futex 7: [0x ae2043ab] do_syscall_64 8: [0x aec000a9] entry_SYSCALL_64_after_hwframe 9: [0x7f99b5ce145c] pthread_cond_wait@@GLIBC_2.3.2@:0 -------------------------------------------------------------------------------- JVM Stack (took: 2023-06-02 17:36:41): 0: [0x7f9930be9d30] __pthread_cond_wait 1: [0x7f9930c46b58] jdk.internal.misc.Unsafe.park 2: [0x7f9930c40b68] java.util.concurrent.locks.LockSupport.park 3: [0x7f9930c2b940] java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt 4: [0x7f9930c2b948] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued 5: [0x7f9930c2ba78] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire 6: [0x7f9930c40f00] java.util.concurrent.locks.ReentrantLock.lock 7: [0x7f99ac2623e0] SampleBlockingApplication.locker (snip)
  30. === 2022-04-03 09:54:00.832 PID: 16179, TID: 16922 (kafka-log-clean), DURATION: 250391556

    us Native Stack: (snip) 3: [0x 843800fd] io_schedule_timeout+0xad 4: [0x 84380198] io_schedule+0x18 5: [0x 8437eb61] bit_wait_io+0x11 (snip) 13: [0x 83dec70c] do_read_fault.isra.63+0x4c 14: [0x 83df11ba] handle_pte_fault+0x22a 15: [0x 83df3ecd] handle_mm_fault+0x39d 16: [0x 84388653] __do_page_fault+0x213 17: [0x 84388975] do_page_fault+0x35 18: [0x 84384778] page_fault+0x28 19: [0x7fb2903cf70c] [unknown] -------------------------------------------------------------------------------- JVM Stack (took: 2022-04-03 09:54:00.847): 0: [0x7face08c0e10] SafepointBlob 1: [0x7face0885d98] java.nio.DirectByteBu er.getLong 2: [0x7face07b89e0] kafka.log.TimeIndex.timestamp 3: [0x7face07b8900] kafka.log.TimeIndex.parseEntry 4: [0x7face07b88f8] kafka.log.TimeIndex.parseEntry 5: [0x7face07b9200] kafka.log.AbstractIndex.indexSlotRangeFor (snip) 27: [0x7face07b1540] kafka.log.LogCleaner$CleanerThread.doWork 28: [0x7face07c2860] kafka.utils.ShutdownableThread.run
  31. Achievement » Running on production Kafka clusters for 2+ years

    » No performance problem caused by overhead » Overhead is extremely low thanks to eBPF + on-demand sampling » Helped a lot of investigations
  32. Takeaways » Unusually long blocking situation rarely happens, but leaves

    critical consequences for production applications when it occurs » Such rarely happen problems are difficult to investigate due to lack of clues, reproducability » Having consistent snapshot of stacktrace from kernel to JVM helps us to learn what has happened, exactly at the time
  33. Takeaways » Building eBPF program is easy, and is super

    powerful » Combining existing tools enables you a new observations that couldn't be achieved until today
  34. Getting Java classes backtraces from eBPF » ucalls.py 4 »

    JVM supports USDT (User Statically Defined Traces) 5 » method-entry and method-return METHOD # CALLS TIME (us) java/io/Bu eredInputStream.getBufIfOpen 1 7.00 slowy/App.isSimplePrime 8970 8858.35 slowy/App.isDivisible 3228196 3076985.12 slowy/App.isPrime 8969 4841017.64 5 https://docs.oracle.com/javase/8/docs/technotes/guides/vm/dtrace.html 4 https://github.com/iovisor/bcc/blob/master/tools/lib/ucalls.py#L68
  35. USDT probes » Requires -XX: +ExtendedDTraceProbes » Performance overhead? 6

    6 https://twitter.com/brendangregg/status/932330965279502336
  36. USDT probe based backtrace » Complex implementation » Store/Remove every

    method invocation in thread local storage » Handling of ping-pong style invocation: JavaMethod A -> Native Function A -> Java Method B » Must hook and record all method-entry, method-exit » We don't know when we'll need to get backtrace
  37. Scaffolding » cargo generate https://github.com/aya-rs/aya-template » hello-project/ » hello-common -

    Shared definitions between userspace program and eBPF program » hello - Userspace program » hello-ebpf - eBPF program