Upgrade to Pro — share decks privately, control downloads, hide ads and more …

async-profilerによるスタックトレースタイムトラベル - Kafkaのパフォーマンス問題調査での応用例

async-profilerによるスタックトレースタイムトラベル - Kafkaのパフォーマンス問題調査での応用例

async-profilerはJavaアプリケーションのパフォーマンス調査でよく用いられる低オーバーヘッドのサンプリングプロファイラーですが、
wallclockプロファイルをJFR形式で出力することで、任意のスレッドの任意時点のスタックトレース解析も可能です。

これは、flamegraphには現れないが時間のかかっている処理の特定や、スレッド間のロック競合の分析に非常に役立ちます。

このセッションでは、本番環境のKafkaで発生したパフォーマンス問題において、スタックトレースタイムトラベルによって原因を特定した事例を紹介します。

また調査の過程で、async-profilerのJFR出力を可視化するツール( https://ocadaruma.github.io/jfrv )を開発しました。こちらについても紹介します。

Haruki Okada / LINE
LINE Senior software engineer

※こちらの資料は以下イベントで発表した内容です。
https://sessionize.com/api/v2/y7inyq6y/view/GridSmart

LINE Developers

June 04, 2023
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Time travel stack trace analysis
 with async-profiler Haruki Okada 2023.06.04

    The application in Kafka performance issue investigation
  2. Speaker 2 •Haruki Okada •Twitter/GitHub: ocadaruma •Lead of LINE IMF

    team •The team responsible for developing/maintaining company-wide
 Kafka platform
  3. Apache Kafka at LINE 4 •One of the most popular

    middleware in LINE •Use cases: •Log pipeline •Job queueing •Pub/Sub events •Scale: •Daily data in/out: > 2 petabytes •Daily message inflow: > 1 trillion messages •Peak messages/sec: > 28 million messages
  4. Performance engineerings 5 •We put a lot of engineering efforts

    to make our cluster highly performant and reliable •Refs. •“Reliability Engineering Behind The Most Trusted Kafka Platform”
 (LINE DEVELOPER DAY 2019) •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP"
 (LINE DEVELOPER DAY 2021)
  5. •Overview of the performance issue on our Kafka cluster •Investigation

    process •How we use async-profiler •Solution •Conclusion Agenda 6
  6. •Produce response time degradation over 600ms on our Kafka broker

    •Usual produce response time is < 50 ms at 99th Phenomenon 7
  7. How requests are processed in Kafka 9 •“RequestQueue” time increase

    implies requests are sit in request queue
 for long time
  8. A producer occupied request handler threads 11 •At the incidental

    time, we observed high thread-time utilization for specific producer •It implies handling requests for this producer took long time
  9. Identifying the hot code path: async-profiler 12 •Popular low overhead

    sampling profiler for Java •https://github.com/jvm-profiling-tools/async-profiler •We use wall-clock profiling which samples all threads regardless
 threads are consuming CPU time or not •profiler.sh -d 30 -e wall -f profile.html $KAFKA_PID
  10. 13 async-profiler result: •Hottest part was waiting lock on Log.append

    request-handler-0 request-handler-1 ... Log file Lock append append append
  11. 14 Identifying the cause of the lock contention •We want

    to identify: •The thread acquired the lock •The method executed at that time by the thread request-handler-0 request-handler-1 ... Log file Lock append append append THREAD-??? Acquired
  12. 15 The flame graph didn’t tell much •Curious thing: No

    other unusually hot code paths on on the flame graph
  13. What’s next? 16 •What we want: •Time travel to the

    incidental time && analyze stack traces
 => HOW? THREAD-??? request-handler-0 ɾ ɾ ɾ pthread_cond_wait 
 at acquire_lock 
 at … method-??? 
 at … 
 at … request-handler-1
  14. Why not JFR & JMC? 17 •Java Flight Recorder (JFR)

    is also a popular profiling tool •JFR provides thread activity timeline
  15. Why not JFR & JMC? 18 •Unlike async-profiler, JFR instruments

    only limited types of I/O
 e.g. •FileRead/Write events by FileChannel/File{Input,Output}Stream/RandomAccessFile •FileForce events by FileChannel •Kafka heavily depends on I/O types which JFR doesn’t support
 e.g. •FileChannel#transferTo: sendfile(2) •MappedByteBuffer: File access through mmap(2)
  16. Why not just taking thread dump periodically? 19 •e.g. •while

    true; do 
 jcmd Thread.print $KAFKA_PID > thread-dump.$(date +%s%3N) 
 sleep 0.1 
 done •Taking thread dump needs to bring the JVM to “safepoint”,
 which all threads are paused meanwhile •It has certain overhead
  17. Solution: async-profiler && JFR output! 20 •async-profiler supports to output

    the profile in JFR (Java Flight Recorder) format •profiler.sh -d 30 -e wall -f profile.jfr $KAFKA_PID
  18. Problem: No good tool for visualization 22 •JMC can’t generate

    threads timeline view from async-profiler JFR output
  19. Let’s create a tool then! 23 •jfrv: A web-based tool

    to visualize async-profiler JFR output •https://ocadaruma.github.io/jfrv Thread list Sample timeline per thread Show thread dump of the highlighted sample
  20. Observation:
 One request-handler was writing to the log 25 pthread_cond_wait@@GLIBC_2.3.2

    at ObjectMonitor::EnterI at ObjectMonitor::enter at ObjectSynchronizer::slow_enter ... at kafka/log/Log.append:1062 __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158
  21. 26 Observation: A blank period in profiled samples •One request-handler

    thread was writing to the log •Which is done inside the lock •There’s a gap in profiled samples for the thread
 => What does it mean? __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158 pthread_cond_wait@@GLIBC_2.3.2 at ObjectMonitor::EnterI at ObjectMonitor::enter at ObjectSynchronizer::slow_enter ... at kafka/log/Log.append:1062
  22. Threads cannot execute signal handler
 in certain circumstances 28 TASK_UNINTERRUPTIBLE:

    The process is waiting on certain special classes of event, such as the completion of a disk I/O. If a signal is generated for a process in this state, then the signal is not delivered until the process emerges from this state. -- The Linux Programming Interface by Michael Kerrisk
  23. This explains why we can’t see any clue from the

    flame graph 30 •write(2) = 1 sample, foo_method = 2 samples •Though write(2) takes long time than foo_method actually
  24. Flame graph doesn’t account samples correctly
 when threads are in

    TASK_UNINTERRUPTIBLE 31 •Flame graph is a visualization which represents sample count by bar length •May not reflect “actual duration” correctly due to the signal handling delay thread (6 samples) Thread.sleep foo_method write(2)
  25. 32 i.e. write(2) took long time actually •The gap was

    likely due to the signal handling delay for long running write(2) __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158
  26. Why write(2) took long time? 33 •write(2) is not supposed

    to block in general •In Linux, what write(2) does is just copying data to “page” in memory •Kernel's writeback task does writing data to the underlying device asynchronously
  27. Observation: Simultaneous
 blank period in another thread 34 __GI_sendfile64 at

    sun/nio/ch/FileChannelImpl.transferTo0 at sun/nio/ch/FileChannelImpl.transferToDirectlyInternal:501 at sun/nio/ch/FileChannelImpl.transferToDirectly:566 ... at org/apache/kafka/common/network/KafkaChannel.send:429 __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158 •i.e. sendfile(2) took long time at the same time
  28. How does Kafka use sendfile(2) 35 •sendfile(2) is a system

    call that copies data from one file descriptor to another •Kafka uses sendfile(2) to send records directly to the consumer to avoid
 copying data to user space Network Threads Consumer Request Socket Filesystem Response ... sendfile(2)
  29. Does sendfile(2) block? 36 •Kafka is a streaming middleware.
 Usually,

    consumers read latest data from tail of the log •Data is "hot" enough so likely resides in page cache (memory) Producer Page Page Page Write Consumer Read Log segment file Not in page cache
  30. sendfile(2) could block 37 •However, consumers sometimes try to read

    old data for catch-up •If the data is not in page cache, sendfile(2) needs to read the data from
 the device (HDD), which is significantly slow •refs: https://logmi.jp/tech/articles/320330 Producer Write Consumer Read Log segment file Page Page Page Not in page cache
  31. Actually, there was a lagging consumer at the incidental time

    38 •This is highly likely the cause of blocking sendfile(2)
  32. How do slow sendfile(2) and slow write(2) relate? 39 •On

    Kafka code level, network-thread and request-handler doesn’t share any lock •write(2) and sendfile(2) just slowed down at the same time •Hypothesis: Lock contention inside Kernel Log segment file request-handler thread network thread Lock???
  33. Experiment 40 •Wrote a small code to check if the

    hypothesis is possible or not •https://gist.github.com/ocadaruma/691a3527fe65bf38bc01aeca0dbd3416 Writer Reader 2. sendfile(2) 3. write(2) 1. Evict from page cache File
  34. Result: write(2) indeed blocked during sendfile(2) 41 •Injected read-delay into

    the device (using dm-delay) •Writing to the file blocked until the sendfile(2) returns $ java App3 /path/to/test.bin [2022-08-08T15:38:07.702003] Took 500 ms to transfer [2022-08-08T15:38:07.702033] Took 450 ms to write ...
  35. Reading Linux kernel source code 42 •Next, we need to

    identify the root cause of the contention •After reading the Linux kernel source code, we found that both requires
 lock on inode • xfs_file_splice_read - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L336 • xfs_file_buffered_aio_write - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L723
  36. inode lock contention in xfs was the root cause! 43

    •There's a discussion in linux-fsdevel ML •https://lore.kernel.org/linux-fsdevel/[email protected]
 "xfs: reduce ilock contention on buffered randrw workload"
  37. Overall flow of the phenomenon 44 •1. Consumer started to

    read old data, which caused slow sendfile(2)
 due to reading data from slow device (HDD) Page Page Log segment file Network Threads Consumer Socket Response Request Handler Threads sendfile(2) Request
  38. Overall flow 45 •2. sendfile(2) blocked produce-request handling on write(2)


    due to the lock contention in xfs Page Page Log segment file Producer Socket Request write(2) Log lock Network Threads Request Handler Threads
  39. Overall flow 46 •3. Meanwhile, other produce requests also blocked

    due to lock contention in Kafka Page Page Log segment file Producer Socket Request Producer ... write(2) Log lock Network Threads Request Handler Threads
  40. Overall flow 47 •4. Request handlers got busy, so delayed

    to dequeue from request queue
 => Resulted in 99th produce response time degradation
  41. Solution 48 •Root cause in short:
 Lagging consumer (which requires

    old data not in page cache) and
 the producer write/reads the same file (inode) Producer Write Consumer Read Log segment file Page Page Page
  42. Solution 49 •Producer always writes to last log segment file

    •=> Set smaller max segment file size to rotate segment files earlier
 than the data is evicted from the page cache Producer Write Consumer Read Log segment file Page Page Page Old log segment file
  43. Conclusion 50 •We can perform time travel stack trace analysis

    using async-profiler JFR output with wall-clock mode •It greatly helps performance issue investigations •Some system calls may cause confusing flame graphs •During executing some system calls (e.g. write(2), sendfile(2)),
 signal handlers cannot be executed •=> async-profiler cannot take samples in the meantime •=> May not appear in flame graphs even when actual duration is long