Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JBM: General JVM blocking monitor powered by eBPF and async-profiler

JBM: General JVM blocking monitor powered by eBPF and async-profiler

LINEのKafkaチームでは、1兆件/day以上のメッセージを取り扱う世界最大規模のKafkaクラスタを運用しています。
我々がこれまでに直面したパフォーマンスに関連した問題の内、最も典型的なのはJVMのあるスレッドが何かしらの理由によりブロックしてしまう、ということに起因しています。
時には、あるアプリケーションスレッドがI/Oの完了を待つ間ブロックしていた、というような単純なものですが、場合によっては、safepoint中に行われた意図しないI/Oや、それに関連したlock競合によりJVM全体がブロックされ、所謂STW(Stop-the-World)になってしまうケースもありました。
そのような問題を数多く経験した後、JBM - JVM Blocking Monitor (https://github.com/kawamuray/jbm) というツールを開発するに至りました。

JBMは、Linux eBPFや、Javaのパフォーマンスプロファイラーとして広く人気のあるasync-profilerを組み合わせることによって、JVMアプリケーションに対する"汎用的なブロッキングの検知とバックトレースの取得"を実現し、I/O、lock待ち、ページフォルト等に起因するJVMスレッドの長時間に渡るブロッキングに対して、JVMレイヤだけでなく、kernelレイヤのバックトレースも含めて取得することができるようにデザインされています。

このセッションでは、JBMがどのように動作するか、またその裏でeBPF programやasync-profilerがどのように実装され動作しているかを詳細に解説し、JBMをJVMアプリケーションモニタリングにどのように活用できるかを紹介します。

Yuto Kawamura
Principal Software Engineer, LINE
Yuto Kawamura is a Principal Software Engineer working at LINE.
Currently he is the manager of the Kafka team, which he founded and led until 2022.
He gave presentations at JJUG 2019 Fall (https://github.com/jjug-ccc/slides-articles-2019Fall#1330-1415), Kafka Summit SF 2018/2019, and LINE Developer Day, trying to help people crossing boundary of application layers for understanding how is your software system works from top to the bottom.

※この資料は以下イベントで発表した内容です。
https://sessionize.com/api/v2/y7inyq6y/view/GridSmart

LINE Developers
PRO

June 04, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. JBM: JVM Blocking
    Monitor

    View Slide

  2. About me
    » Yuto Kawamura
    » Principal Software Engineer @ LINE
    » Manager of the IMF Team = Kafka Infrastructure Team
    » ❤ Performance Engineering

    View Slide

  3. Apache Kafka( at
    LINE)
    » Distributed messaging (queueing)
    middleware
    » Used as company-wide
    messaging/distributed messaging
    infrastructure
    » Persistency by disks
    » Written in Scala, runs on JVM

    View Slide

  4. Cluster scale
    Messages 1.09 trillion / day
    In/Out data flow 2.2PB / day
    Users 220+ services

    View Slide

  5. A lot of performance problems
    » TCP࣮૷ʹىҼ͢Δେن໛KafkaΫϥελʔͷϦΫΤετ஗Ԇղܾ·
    ͰͷಓͷΓ
    » Reliability Engineering Behind The Most Trusted Kafka Platform -
    LINE Developer Day 2019

    View Slide

  6. Apache Kafka heavily relies on disk
    for persistency

    View Slide

  7. Thread "blocking" by I/O

    View Slide

  8. It could be even more complex

    View Slide

  9. "Thread blocking" often cascades

    View Slide

  10. Investigating blocking application
    requires a lot of effort
    » Application-level debugging
    » JVM-level inspections
    » Kernel-level inspections (eBPF, SystemTap)
    » Look into every single detail

    View Slide

  11. Measuring only the "blocked"
    thread is
    » Insufficient
    » Not enough to reveal the whole causal relationship

    View Slide

  12. Can't we have a
    snapshot of what was
    happening, at the time
    of blocking?

    View Slide

  13. Detecting those "blocking"
    behavior from JVM layer?
    » JVMTI agent?
    » while true; jstack $PID; sleep 1; done?

    View Slide

  14. JVM thread states?
    Which thread is a thread at when it's waiting disk I/O to complete?
    » NEW
    » RUNNABLE
    » BLOCKING
    » WAITING
    » TIMED_WAITING
    » TERMINATED

    View Slide

  15. JVM thread states?
    Which thread is a thread at when it's waiting disk I/O to complete?
    » NEW
    » RUNNABLE : "A thread executing in the Java virtual machine is in this state."
    » BLOCKING
    » WAITING
    » TIMED_WAITING
    » TERMINATED

    View Slide

  16. It is not possible to detect all kinds
    of blocking from JVM layer
    » JVM has no knowledge of what's happening inside the kernel
    » From JVM perspective, a thread calling system call (maybe I/O) is
    RUNNABLE

    View Slide

  17. Backtrace from kernel to JVM

    View Slide

  18. Overview of the desired
    tool

    View Slide

  19. Building blocks
    1. General blocking detection
    2. Backtrace of kernel space functions
    3. Backtrace of JVM methods

    View Slide

  20. General blocking
    detection

    View Slide

  21. General blocking detection
    » When an application thread blocks, it is ALWAYS at some point of calling
    kernel functions
    » read/write I/O => calling system call
    » waiting on a lock (mutex) => calling system call
    » page fault => executing kernel's page fault handler
    » Definition of the "thread is blocking"
    » = The thread can't consume CPU time anymore until the upcoming
    event occurs

    View Slide

  22. eBPF
    » extended Berkeley Packet Filter
    » https://ebpf.io/
    » In short, programmable VM which
    runs inside the Linux kernel

    View Slide

  23. finish_task_switch
    function
    » A function of linux kernel, which is
    called when tasks (threads)
    switched on a cpu
    » https://github.com/torvalds/linux/
    blob/
    0d85b27b0cc6b5cf54567c5ad913a
    247a71583ce/kernel/sched/
    core.c#L5154

    View Slide

  24. Blocking time detection

    View Slide

  25. Blocking time detection

    View Slide

  26. Backtrace of kernel
    space functions

    View Slide

  27. Backtrace of kernel
    space functions
    » eBPF supports backtrace function 1
    » Both kernel space, and user space
    by BPF_F_USER_STACK flag
    1 https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

    View Slide

  28. bpf_get_stackid
    0: [0x ae31f741] finish_task_switch
    1: [0x aebeb169] __sched_text_start
    2: [0x aebeb725] schedule
    3: [0x aebebb22] io_schedule
    4: [0x ae4959c3] wait_on_page_bit
    5: [0x ae493798] __filemap_fdatawait_range
    6: [0x ae49967a] file_write_and_wait_range
    7: [0x c02da50d] xfs_file_fsync [xfs]
    8: [0x ae5976e8] do_fsync
    9: [0x ae597773] __x64_sys_fdatasync
    10: [0x ae2043ab] do_syscall_64
    11: [0x aec000a9] entry_SYSCALL_64_after_hwframe
    12: [0x7f32b1af1a4f] fdatasync@:0
    13: [0x7f329b86fa17] [unknown symbol name]
    14: [0x7f329b8c1cd4] [unknown symbol name]

    View Slide

  29. Backtrace of JVM
    methods

    View Slide

  30. How can we take backtrace of a
    specific thread, at arbitrary time?
    » jstack?
    » => It takes stacktrace of the whole JVM, requires safepoint.
    » Thread.currentThread().getStackTrace()
    » => The target thread is in "blocking" state when we want to
    take backtrace

    View Slide

  31. Complexity of collecting intuitive
    user-space stack traces
    Multiple sources of frames:
    » C library (glibc, native library part of the JVM)
    » Native binary from JIT compiled Java classes
    » Java classes on an interpreter

    View Slide

  32. Implement JVM (user-space)
    backtrace by myself?
    I don't think so.

    View Slide

  33. async-profiler for the rescue
    » JVM profiler featuring AsyncGetCallTrace + perf_events
    » https://github.com/jvm-profiling-tools/async-profiler

    View Slide

  34. async-profiler takes stacktraces

    View Slide

  35. async-profiler takes stacktraces
    => We can rely on the async-profiler library to take reliable backtrace
    of user-space!
    » But how async-profiler does stacktrace sampling?
    » => When we run profiler.sh -e cpu ...

    View Slide

  36. async-profiler cpu profiling

    View Slide

  37. Collecting backtraces of threads

    View Slide

  38. async-profiler limitations
    » Currently it supports "start recording and dump on stop" only
    » Supported events (sampling timing): cpu, alloc, lock, wall
    However we need:
    » To take backtrace at arbitrary timing (when a thread is "blocked")
    » Streaming of backtraces to log an event

    View Slide

  39. Patching async-profiler
    Added -e none and -o stream options.
    https://github.com/kawamuray/async-profiler/commits/master
    Dump each event in format like:
    { "timestamp": 12345,
    "tid": 1212,
    "frames": [
    { bci: 1234, methodId: 1234, symbol: "java/lang/net...." }
    ]
    }

    View Slide

  40. Implementation

    View Slide

  41. Ways to implement
    eBPF programs
    » bcc - https://github.com/iovisor/
    bcc
    » libbpf - https://github.com/libbpf/
    libbpf

    View Slide

  42. Implement eBPF app
    w/ Rust - Aya
    » https://github.com/aya-rs/aya
    » Rust development framework for
    eBPF application
    » We can implement eBPF program
    itself in Rust

    View Slide

  43. View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. Experiment

    View Slide

  49. Sample "blocky" application
    public class SampleBlockingApplication {
    ReentrantLock lock = new ReentrantLock();
    new Thread(() -> writer(lock)).start();
    new Thread(() -> locker(lock)).start();
    }

    View Slide

  50. WRITER
    private static void writer(Path path, ReentrantLock lock) throws IOException, InterruptedException {
    byte[] data = new byte[CHUNK_SIZE];
    try (FileChannel channel = FileChannel.open(path, StandardOpenOption.CREATE, StandardOpenOption.WRITE)) {
    while (true) {
    lock.lock();
    try {
    channel.position(0);
    channel.write(ByteBu er.wrap(data));
    channel.force(false); // fsync()
    } finally {
    lock.unlock();
    }
    Thread.sleep(1000);
    }
    }
    }

    View Slide

  51. LOCKER
    private static void locker(ReentrantLock lock) throws InterruptedException {
    while (true) {
    lock.lock();
    Thread.sleep(1);
    lock.unlock();
    Thread.sleep(1000);
    }
    }

    View Slide

  52. Simulate "slow" disk access
    » DDI - Disk Delay Injection
    » https://github.com/kawamuray/ddi
    sudo ./ddi-setup.sh create /path/to/mount
    echo 2000 | sudo tee /sys/fs/ddi/7:0/write_delay # Delay write iops by 2 second

    View Slide

  53. JBM: Running
    jbm -p TARGET_JVM_PID -m 1000000 # 1 sec
    It will:
    1. Install async-profiler JVMTI on target JVM
    2. Install eBPF program
    3. Listen for events from eBPF, and combine received event with event
    stream from JVMTI agent
    4. Output event reports

    View Slide

  54. === 2023-06-02 17:36:41 1685695001988 PID: 11946, TID: 11963 (WRITER), DURATION: 2501820 us
    Native Stack:
    (snip)
    4: [0x aebebae9] io_schedule_timeout
    5: [0x aebec21d] wait_for_completion_io_timeout
    6: [0x ae68a742] submit_bio_wait
    7: [0x ae694da9] blkdev_issue_flush
    8: [0x c02da5a5] xfs_file_fsync [xfs]
    9: [0x ae5976e8] do_fsync
    10: [0x ae597773] __x64_sys_fdatasync
    11: [0x ae2043ab] do_syscall_64
    12: [0x aec000a9] entry_SYSCALL_64_after_hwframe
    13: [0x7f99b5615a4f] fdatasync@:0
    (snip)
    --------------------------------------------------------------------------------
    JVM Stack (took: 2023-06-02 17:36:41):
    0: [0x7f9930b998a0] fdatasync
    1: [0x7f9930c21090] sun.nio.ch.FileDispatcherImpl.force0
    2: [0x7f9930c21080] sun.nio.ch.FileDispatcherImpl.force
    3: [0x7f9930c23170] sun.nio.ch.FileChannelImpl.force
    4: [0x7f99ac2623d8] SampleBlockingApplication.writer
    (snip)

    View Slide

  55. === 2023-06-02 17:36:41 1685695001989 PID: 11946, TID: 11965 (LOCKER), DURATION: 10270708 us
    Native Stack:
    (snip)
    3: [0x ae3a1943] futex_wait_queue_me
    4: [0x ae3a36cf] futex_wait
    5: [0x ae3a4b89] do_futex
    6: [0x ae3a5245] __x64_sys_futex
    7: [0x ae2043ab] do_syscall_64
    8: [0x aec000a9] entry_SYSCALL_64_after_hwframe
    9: [0x7f99b5ce145c] pthread_cond_wait@@GLIBC_2.3.2@:0
    --------------------------------------------------------------------------------
    JVM Stack (took: 2023-06-02 17:36:41):
    0: [0x7f9930be9d30] __pthread_cond_wait
    1: [0x7f9930c46b58] jdk.internal.misc.Unsafe.park
    2: [0x7f9930c40b68] java.util.concurrent.locks.LockSupport.park
    3: [0x7f9930c2b940] java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt
    4: [0x7f9930c2b948] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued
    5: [0x7f9930c2ba78] java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire
    6: [0x7f9930c40f00] java.util.concurrent.locks.ReentrantLock.lock
    7: [0x7f99ac2623e0] SampleBlockingApplication.locker
    (snip)

    View Slide

  56. === 2022-04-03 09:54:00.832 PID: 16179, TID: 16922 (kafka-log-clean), DURATION: 250391556 us
    Native Stack:
    (snip)
    3: [0x 843800fd] io_schedule_timeout+0xad
    4: [0x 84380198] io_schedule+0x18
    5: [0x 8437eb61] bit_wait_io+0x11
    (snip)
    13: [0x 83dec70c] do_read_fault.isra.63+0x4c
    14: [0x 83df11ba] handle_pte_fault+0x22a
    15: [0x 83df3ecd] handle_mm_fault+0x39d
    16: [0x 84388653] __do_page_fault+0x213
    17: [0x 84388975] do_page_fault+0x35
    18: [0x 84384778] page_fault+0x28
    19: [0x7fb2903cf70c] [unknown]
    --------------------------------------------------------------------------------
    JVM Stack (took: 2022-04-03 09:54:00.847):
    0: [0x7face08c0e10] SafepointBlob
    1: [0x7face0885d98] java.nio.DirectByteBu er.getLong
    2: [0x7face07b89e0] kafka.log.TimeIndex.timestamp
    3: [0x7face07b8900] kafka.log.TimeIndex.parseEntry
    4: [0x7face07b88f8] kafka.log.TimeIndex.parseEntry
    5: [0x7face07b9200] kafka.log.AbstractIndex.indexSlotRangeFor
    (snip)
    27: [0x7face07b1540] kafka.log.LogCleaner$CleanerThread.doWork
    28: [0x7face07c2860] kafka.utils.ShutdownableThread.run

    View Slide

  57. Achievement
    » Running on production Kafka clusters for 2+ years
    » No performance problem caused by overhead
    » Overhead is extremely low thanks to eBPF + on-demand
    sampling
    » Helped a lot of investigations

    View Slide

  58. Takeaways
    » Unusually long blocking situation rarely happens, but leaves
    critical consequences for production applications when it occurs
    » Such rarely happen problems are difficult to investigate due to lack
    of clues, reproducability
    » Having consistent snapshot of stacktrace from kernel to JVM helps
    us to learn what has happened, exactly at the time

    View Slide

  59. Takeaways
    » Building eBPF program is easy, and is super powerful
    » Combining existing tools enables you a new observations that
    couldn't be achieved until today

    View Slide

  60. Wanna try JBM?
    It's an OSS :)
    https://github.com/kawamuray/jbm

    View Slide

  61. End of presentation

    View Slide

  62. Getting Java classes backtraces
    from eBPF
    » ucalls.py 4
    » JVM supports USDT (User Statically Defined Traces) 5
    » method-entry and method-return
    METHOD # CALLS TIME (us)
    java/io/Bu eredInputStream.getBufIfOpen 1 7.00
    slowy/App.isSimplePrime 8970 8858.35
    slowy/App.isDivisible 3228196 3076985.12
    slowy/App.isPrime 8969 4841017.64
    5 https://docs.oracle.com/javase/8/docs/technotes/guides/vm/dtrace.html
    4 https://github.com/iovisor/bcc/blob/master/tools/lib/ucalls.py#L68

    View Slide

  63. USDT probes
    » Requires -XX:
    +ExtendedDTraceProbes
    » Performance overhead? 6
    6 https://twitter.com/brendangregg/status/932330965279502336

    View Slide

  64. USDT probe based backtrace
    » Complex implementation
    » Store/Remove every method invocation in thread local storage
    » Handling of ping-pong style invocation: JavaMethod A ->
    Native Function A -> Java Method B
    » Must hook and record all method-entry, method-exit
    » We don't know when we'll need to get backtrace

    View Slide

  65. Scaffolding
    » cargo generate https://github.com/aya-rs/aya-template
    » hello-project/
    » hello-common - Shared definitions between userspace program
    and eBPF program
    » hello - Userspace program
    » hello-ebpf - eBPF program

    View Slide