Upgrade to Pro — share decks privately, control downloads, hide ads and more …

async-profilerによるスタックトレースタイムトラベル - Kafkaのパフォーマンス問題調査での応用例

async-profilerによるスタックトレースタイムトラベル - Kafkaのパフォーマンス問題調査での応用例

async-profilerはJavaアプリケーションのパフォーマンス調査でよく用いられる低オーバーヘッドのサンプリングプロファイラーですが、
wallclockプロファイルをJFR形式で出力することで、任意のスレッドの任意時点のスタックトレース解析も可能です。

これは、flamegraphには現れないが時間のかかっている処理の特定や、スレッド間のロック競合の分析に非常に役立ちます。

このセッションでは、本番環境のKafkaで発生したパフォーマンス問題において、スタックトレースタイムトラベルによって原因を特定した事例を紹介します。

また調査の過程で、async-profilerのJFR出力を可視化するツール( https://ocadaruma.github.io/jfrv )を開発しました。こちらについても紹介します。

Haruki Okada / LINE
LINE Senior software engineer

※こちらの資料は以下イベントで発表した内容です。
https://sessionize.com/api/v2/y7inyq6y/view/GridSmart

LINE Developers
PRO

June 04, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Time travel stack trace analysis

    with async-profiler
    Haruki Okada
    2023.06.04
    The application in Kafka performance issue investigation

    View Slide

  2. Speaker
    2
    •Haruki Okada
    •Twitter/GitHub: ocadaruma
    •Lead of LINE IMF team
    •The team responsible for developing/maintaining company-wide

    Kafka platform

    View Slide

  3. Apache Kafka
    3
    •Streaming middleware
    •Written in Scala/Java. Runs on JVM

    View Slide

  4. Apache Kafka at LINE
    4
    •One of the most popular middleware in LINE
    •Use cases:
    •Log pipeline
    •Job queueing
    •Pub/Sub events
    •Scale:
    •Daily data in/out: > 2 petabytes
    •Daily message inflow: > 1 trillion messages
    •Peak messages/sec: > 28 million messages

    View Slide

  5. Performance engineerings
    5
    •We put a lot of engineering efforts to make our cluster highly performant and reliable
    •Refs.
    •“Reliability Engineering Behind The Most Trusted Kafka Platform”

    (LINE DEVELOPER DAY 2019)
    •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP"

    (LINE DEVELOPER DAY 2021)

    View Slide

  6. •Overview of the performance issue on our Kafka cluster
    •Investigation process
    •How we use async-profiler
    •Solution
    •Conclusion
    Agenda
    6

    View Slide

  7. •Produce response time degradation over 600ms on our Kafka broker
    •Usual produce response time is < 50 ms at 99th
    Phenomenon
    7

    View Slide

  8. •High “RequestQueue” time
    Produce response time breakdown
    8

    View Slide

  9. How requests are processed in Kafka
    9
    •“RequestQueue” time increase implies requests are sit in request queue

    for long time

    View Slide

  10. 10
    Request handlers were busy

    View Slide

  11. A producer occupied request handler threads
    11
    •At the incidental time, we observed high thread-time utilization for specific producer
    •It implies handling requests for this producer took long time

    View Slide

  12. Identifying the hot code path: async-profiler
    12
    •Popular low overhead sampling profiler for Java
    •https://github.com/jvm-profiling-tools/async-profiler
    •We use wall-clock profiling which samples all threads regardless

    threads are consuming CPU time or not
    •profiler.sh -d 30 -e wall -f profile.html $KAFKA_PID

    View Slide

  13. 13
    async-profiler result:
    •Hottest part was waiting lock on Log.append
    request-handler-0
    request-handler-1
    ...
    Log file
    Lock
    append
    append
    append

    View Slide

  14. 14
    Identifying the cause of the lock contention
    •We want to identify:
    •The thread acquired the lock
    •The method executed at that time by the thread
    request-handler-0
    request-handler-1
    ...
    Log file
    Lock
    append
    append
    append
    THREAD-???
    Acquired

    View Slide

  15. 15
    The flame graph didn’t tell much
    •Curious thing: No other unusually hot code paths on on the flame graph

    View Slide

  16. What’s next?
    16
    •What we want:
    •Time travel to the incidental time && analyze stack traces

    => HOW?
    THREAD-???
    request-handler-0
    ɾ
    ɾ
    ɾ pthread_cond_wait

    at acquire_lock

    at …
    method-???

    at …

    at …
    request-handler-1

    View Slide

  17. Why not JFR & JMC?
    17
    •Java Flight Recorder (JFR) is also a popular profiling tool
    •JFR provides thread activity timeline

    View Slide

  18. Why not JFR & JMC?
    18
    •Unlike async-profiler, JFR instruments only limited types of I/O

    e.g.
    •FileRead/Write events by FileChannel/File{Input,Output}Stream/RandomAccessFile
    •FileForce events by FileChannel
    •Kafka heavily depends on I/O types which JFR doesn’t support

    e.g.
    •FileChannel#transferTo: sendfile(2)
    •MappedByteBuffer: File access through mmap(2)

    View Slide

  19. Why not just taking thread dump periodically?
    19
    •e.g.
    •while true; do

    jcmd Thread.print $KAFKA_PID > thread-dump.$(date +%s%3N)

    sleep 0.1

    done
    •Taking thread dump needs to bring the JVM to “safepoint”,

    which all threads are paused meanwhile
    •It has certain overhead

    View Slide

  20. Solution: async-profiler && JFR output!
    20
    •async-profiler supports to output the profile in JFR (Java Flight Recorder) format
    •profiler.sh -d 30 -e wall -f profile.jfr $KAFKA_PID

    View Slide

  21. Idea: Generate threads timeline from JFR
    21

    View Slide

  22. Problem: No good tool for visualization
    22
    •JMC can’t generate threads timeline view from async-profiler JFR output

    View Slide

  23. Let’s create a tool then!
    23
    •jfrv: A web-based tool to visualize async-profiler JFR output
    •https://ocadaruma.github.io/jfrv
    Thread list Sample timeline per thread
    Show thread dump of the highlighted sample

    View Slide

  24. Finally we succeeded to visualize threads

    at the incidental time
    24

    View Slide

  25. Observation:

    One request-handler was writing to the log
    25
    pthread_cond_wait@@GLIBC_2.3.2


    at ObjectMonitor::EnterI


    at ObjectMonitor::enter


    at ObjectSynchronizer::slow_enter


    ...


    at kafka/log/Log.append:1062
    __write


    at sun/nio/ch/FileDispatcherImpl.write0


    at sun/nio/ch/FileDispatcherImpl.write:62


    at sun/nio/ch/IOUtil.writeFromNativeBuffer:113


    ...


    at kafka/log/LogSegment.append:158

    View Slide

  26. 26
    Observation: A blank period in profiled samples
    •One request-handler thread was writing to the log
    •Which is done inside the lock
    •There’s a gap in profiled samples for the thread

    => What does it mean?
    __write


    at sun/nio/ch/FileDispatcherImpl.write0


    at sun/nio/ch/FileDispatcherImpl.write:62


    at sun/nio/ch/IOUtil.writeFromNativeBuffer:113


    ...


    at kafka/log/LogSegment.append:158
    pthread_cond_wait@@GLIBC_2.3.2


    at ObjectMonitor::EnterI


    at ObjectMonitor::enter


    at ObjectSynchronizer::slow_enter


    ...


    at kafka/log/Log.append:1062

    View Slide

  27. How async-profiler (wall-clock mode) works
    27

    View Slide

  28. Threads cannot execute signal handler

    in certain circumstances
    28
    TASK_UNINTERRUPTIBLE: The process is waiting on certain special classes of event,
    such as the completion of a disk I/O.
    If a signal is generated for a process in this state, then the signal is not delivered until the
    process emerges from this state.
    -- The Linux Programming Interface by Michael Kerrisk

    View Slide

  29. Long running write(2) => 

    Signal handling delay => Blank period in profiled samples
    29

    View Slide

  30. This explains why we can’t see any clue from the flame graph
    30
    •write(2) = 1 sample, foo_method = 2 samples
    •Though write(2) takes long time than foo_method actually

    View Slide

  31. Flame graph doesn’t account samples correctly

    when threads are in TASK_UNINTERRUPTIBLE
    31
    •Flame graph is a visualization which represents sample count by bar length
    •May not reflect “actual duration” correctly due to the signal handling delay
    thread (6 samples)
    Thread.sleep foo_method
    write(2)

    View Slide

  32. 32
    i.e. write(2) took long time actually
    •The gap was likely due to the signal handling delay for long running write(2)
    __write


    at sun/nio/ch/FileDispatcherImpl.write0


    at sun/nio/ch/FileDispatcherImpl.write:62


    at sun/nio/ch/IOUtil.writeFromNativeBuffer:113


    ...


    at kafka/log/LogSegment.append:158

    View Slide

  33. Why write(2) took long time?
    33
    •write(2) is not supposed to block in general
    •In Linux, what write(2) does is just copying data to “page” in memory
    •Kernel's writeback task does writing data to the underlying device asynchronously

    View Slide

  34. Observation: Simultaneous

    blank period in another thread
    34
    __GI_sendfile64


    at sun/nio/ch/FileChannelImpl.transferTo0


    at sun/nio/ch/FileChannelImpl.transferToDirectlyInternal:501


    at sun/nio/ch/FileChannelImpl.transferToDirectly:566


    ...


    at org/apache/kafka/common/network/KafkaChannel.send:429
    __write


    at sun/nio/ch/FileDispatcherImpl.write0


    at sun/nio/ch/FileDispatcherImpl.write:62


    at sun/nio/ch/IOUtil.writeFromNativeBuffer:113


    ...


    at kafka/log/LogSegment.append:158
    •i.e. sendfile(2) took long time at the same time

    View Slide

  35. How does Kafka use sendfile(2)
    35
    •sendfile(2) is a system call that copies data from one file descriptor to another
    •Kafka uses sendfile(2) to send records directly to the consumer to avoid

    copying data to user space
    Network
    Threads
    Consumer
    Request
    Socket
    Filesystem
    Response
    ...
    sendfile(2)

    View Slide

  36. Does sendfile(2) block?
    36
    •Kafka is a streaming middleware.

    Usually, consumers read latest data from tail of the log
    •Data is "hot" enough so likely resides in page cache (memory)
    Producer
    Page
    Page
    Page
    Write
    Consumer
    Read
    Log segment file
    Not in page cache

    View Slide

  37. sendfile(2) could block
    37
    •However, consumers sometimes try to read old data for catch-up
    •If the data is not in page cache, sendfile(2) needs to read the data from

    the device (HDD), which is significantly slow
    •refs: https://logmi.jp/tech/articles/320330
    Producer
    Write
    Consumer
    Read
    Log segment file
    Page
    Page
    Page
    Not in page cache

    View Slide

  38. Actually, there was a lagging consumer at the incidental time
    38
    •This is highly likely the cause of blocking sendfile(2)

    View Slide

  39. How do slow sendfile(2) and slow write(2) relate?
    39
    •On Kafka code level, network-thread and request-handler doesn’t share any lock
    •write(2) and sendfile(2) just slowed down at the same time
    •Hypothesis: Lock contention inside Kernel
    Log segment
    file
    request-handler
    thread
    network thread
    Lock???

    View Slide

  40. Experiment
    40
    •Wrote a small code to check if the hypothesis is possible or not
    •https://gist.github.com/ocadaruma/691a3527fe65bf38bc01aeca0dbd3416
    Writer Reader
    2. sendfile(2)
    3. write(2)
    1. Evict from page cache
    File

    View Slide

  41. Result: write(2) indeed blocked during sendfile(2)
    41
    •Injected read-delay into the device (using dm-delay)
    •Writing to the file blocked until the sendfile(2) returns
    $ java App3 /path/to/test.bin


    [2022-08-08T15:38:07.702003] Took 500 ms to transfer


    [2022-08-08T15:38:07.702033] Took 450 ms to write


    ...

    View Slide

  42. Reading Linux kernel source code
    42
    •Next, we need to identify the root cause of the contention
    •After reading the Linux kernel source code, we found that both requires

    lock on inode
    • xfs_file_splice_read - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L336
    • xfs_file_buffered_aio_write - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L723

    View Slide

  43. inode lock contention in xfs was the root cause!
    43
    •There's a discussion in linux-fsdevel ML
    •https://lore.kernel.org/linux-fsdevel/[email protected]
    "xfs: reduce ilock contention on buffered randrw workload"

    View Slide

  44. Overall flow of the phenomenon
    44
    •1. Consumer started to read old data, which caused slow sendfile(2)

    due to reading data from slow device (HDD)
    Page
    Page
    Log segment file
    Network
    Threads
    Consumer Socket
    Response
    Request
    Handler
    Threads
    sendfile(2)
    Request

    View Slide

  45. Overall flow
    45
    •2. sendfile(2) blocked produce-request handling on write(2)

    due to the lock contention in xfs
    Page
    Page
    Log segment file
    Producer Socket
    Request
    write(2)
    Log
    lock
    Network
    Threads
    Request
    Handler
    Threads

    View Slide

  46. Overall flow
    46
    •3. Meanwhile, other produce requests also blocked due to lock contention in Kafka
    Page
    Page
    Log segment file
    Producer Socket
    Request
    Producer
    ...
    write(2)
    Log
    lock
    Network
    Threads
    Request
    Handler
    Threads

    View Slide

  47. Overall flow
    47
    •4. Request handlers got busy, so delayed to dequeue from request queue

    => Resulted in 99th produce response time degradation

    View Slide

  48. Solution
    48
    •Root cause in short:

    Lagging consumer (which requires old data not in page cache) and

    the producer write/reads the same file (inode)
    Producer
    Write
    Consumer
    Read
    Log segment file
    Page
    Page
    Page

    View Slide

  49. Solution
    49
    •Producer always writes to last log segment file
    •=> Set smaller max segment file size to rotate segment files earlier

    than the data is evicted from the page cache
    Producer
    Write
    Consumer
    Read
    Log segment file
    Page
    Page
    Page
    Old log segment file

    View Slide

  50. Conclusion
    50
    •We can perform time travel stack trace analysis using async-profiler JFR output
    with wall-clock mode
    •It greatly helps performance issue investigations
    •Some system calls may cause confusing flame graphs
    •During executing some system calls (e.g. write(2), sendfile(2)),

    signal handlers cannot be executed
    •=> async-profiler cannot take samples in the meantime
    •=> May not appear in flame graphs even when actual duration is long

    View Slide