async-profilerによるスタックトレースタイムトラベル - Kafkaのパフォーマンス問題調査での応用例

Time travel stack trace analysis  with async-profiler Haruki Okada 2023.06.04
The application in Kafka performance issue investigation

Speaker 2 •Haruki Okada •Twitter/GitHub: ocadaruma •Lead of LINE IMF
team •The team responsible for developing/maintaining company-wide  Kafka platform

Apache Kafka 3 •Streaming middleware •Written in Scala/Java. Runs on
JVM

Apache Kafka at LINE 4 •One of the most popular
middleware in LINE •Use cases: •Log pipeline •Job queueing •Pub/Sub events •Scale: •Daily data in/out: > 2 petabytes •Daily message inflow: > 1 trillion messages •Peak messages/sec: > 28 million messages

Performance engineerings 5 •We put a lot of engineering efforts
to make our cluster highly performant and reliable •Refs. •“Reliability Engineering Behind The Most Trusted Kafka Platform”  (LINE DEVELOPER DAY 2019) •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP"  (LINE DEVELOPER DAY 2021)

•Overview of the performance issue on our Kafka cluster •Investigation
process •How we use async-profiler •Solution •Conclusion Agenda 6

•Produce response time degradation over 600ms on our Kafka broker
•Usual produce response time is < 50 ms at 99th Phenomenon 7

•High “RequestQueue” time Produce response time breakdown 8

How requests are processed in Kafka 9 •“RequestQueue” time increase
implies requests are sit in request queue  for long time

10 Request handlers were busy

A producer occupied request handler threads 11 •At the incidental
time, we observed high thread-time utilization for specific producer •It implies handling requests for this producer took long time

Identifying the hot code path: async-profiler 12 •Popular low overhead
sampling profiler for Java •https://github.com/jvm-profiling-tools/async-profiler •We use wall-clock profiling which samples all threads regardless  threads are consuming CPU time or not •profiler.sh -d 30 -e wall -f profile.html $KAFKA_PID

13 async-profiler result: •Hottest part was waiting lock on Log.append
request-handler-0 request-handler-1 ... Log file Lock append append append

14 Identifying the cause of the lock contention •We want
to identify: •The thread acquired the lock •The method executed at that time by the thread request-handler-0 request-handler-1 ... Log file Lock append append append THREAD-??? Acquired

15 The flame graph didn’t tell much •Curious thing: No
other unusually hot code paths on on the flame graph

What’s next? 16 •What we want: •Time travel to the
incidental time && analyze stack traces  => HOW? THREAD-??? request-handler-0 ɾ ɾ ɾ pthread_cond_wait   at acquire_lock   at … method-???   at …   at … request-handler-1

Why not JFR & JMC? 17 •Java Flight Recorder (JFR)
is also a popular profiling tool •JFR provides thread activity timeline

Why not JFR & JMC? 18 •Unlike async-profiler, JFR instruments
only limited types of I/O  e.g. •FileRead/Write events by FileChannel/File{Input,Output}Stream/RandomAccessFile •FileForce events by FileChannel •Kafka heavily depends on I/O types which JFR doesn’t support  e.g. •FileChannel#transferTo: sendfile(2) •MappedByteBuffer: File access through mmap(2)

Why not just taking thread dump periodically? 19 •e.g. •while
true; do   jcmd Thread.print $KAFKA_PID > thread-dump.$(date +%s%3N)   sleep 0.1   done •Taking thread dump needs to bring the JVM to “safepoint”,  which all threads are paused meanwhile •It has certain overhead

Solution: async-profiler && JFR output! 20 •async-profiler supports to output
the profile in JFR (Java Flight Recorder) format •profiler.sh -d 30 -e wall -f profile.jfr $KAFKA_PID

Idea: Generate threads timeline from JFR 21

Problem: No good tool for visualization 22 •JMC can’t generate
threads timeline view from async-profiler JFR output

Let’s create a tool then! 23 •jfrv: A web-based tool
to visualize async-profiler JFR output •https://ocadaruma.github.io/jfrv Thread list Sample timeline per thread Show thread dump of the highlighted sample

Finally we succeeded to visualize threads  at the incidental time
24

Observation:  One request-handler was writing to the log 25 pthread_cond_wait@@GLIBC_2.3.2
at ObjectMonitor::EnterI at ObjectMonitor::enter at ObjectSynchronizer::slow_enter ... at kafka/log/Log.append:1062 __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158

26 Observation: A blank period in profiled samples •One request-handler
thread was writing to the log •Which is done inside the lock •There’s a gap in profiled samples for the thread  => What does it mean? __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158 pthread_cond_wait@@GLIBC_2.3.2 at ObjectMonitor::EnterI at ObjectMonitor::enter at ObjectSynchronizer::slow_enter ... at kafka/log/Log.append:1062

How async-profiler (wall-clock mode) works 27

Threads cannot execute signal handler  in certain circumstances 28 TASK_UNINTERRUPTIBLE:
The process is waiting on certain special classes of event, such as the completion of a disk I/O. If a signal is generated for a process in this state, then the signal is not delivered until the process emerges from this state. -- The Linux Programming Interface by Michael Kerrisk

Long running write(2) =>   Signal handling delay => Blank
period in profiled samples 29

This explains why we can’t see any clue from the
flame graph 30 •write(2) = 1 sample, foo_method = 2 samples •Though write(2) takes long time than foo_method actually

Flame graph doesn’t account samples correctly  when threads are in
TASK_UNINTERRUPTIBLE 31 •Flame graph is a visualization which represents sample count by bar length •May not reflect “actual duration” correctly due to the signal handling delay thread (6 samples) Thread.sleep foo_method write(2)

32 i.e. write(2) took long time actually •The gap was
likely due to the signal handling delay for long running write(2) __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158

Why write(2) took long time? 33 •write(2) is not supposed
to block in general •In Linux, what write(2) does is just copying data to “page” in memory •Kernel's writeback task does writing data to the underlying device asynchronously

Observation: Simultaneous  blank period in another thread 34 __GI_sendfile64 at
sun/nio/ch/FileChannelImpl.transferTo0 at sun/nio/ch/FileChannelImpl.transferToDirectlyInternal:501 at sun/nio/ch/FileChannelImpl.transferToDirectly:566 ... at org/apache/kafka/common/network/KafkaChannel.send:429 __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158 •i.e. sendfile(2) took long time at the same time

How does Kafka use sendfile(2) 35 •sendfile(2) is a system
call that copies data from one file descriptor to another •Kafka uses sendfile(2) to send records directly to the consumer to avoid  copying data to user space Network Threads Consumer Request Socket Filesystem Response ... sendfile(2)

Does sendfile(2) block? 36 •Kafka is a streaming middleware.  Usually,
consumers read latest data from tail of the log •Data is "hot" enough so likely resides in page cache (memory) Producer Page Page Page Write Consumer Read Log segment file Not in page cache

sendfile(2) could block 37 •However, consumers sometimes try to read
old data for catch-up •If the data is not in page cache, sendfile(2) needs to read the data from  the device (HDD), which is significantly slow •refs: https://logmi.jp/tech/articles/320330 Producer Write Consumer Read Log segment file Page Page Page Not in page cache

Actually, there was a lagging consumer at the incidental time
38 •This is highly likely the cause of blocking sendfile(2)

How do slow sendfile(2) and slow write(2) relate? 39 •On
Kafka code level, network-thread and request-handler doesn’t share any lock •write(2) and sendfile(2) just slowed down at the same time •Hypothesis: Lock contention inside Kernel Log segment file request-handler thread network thread Lock???

Experiment 40 •Wrote a small code to check if the
hypothesis is possible or not •https://gist.github.com/ocadaruma/691a3527fe65bf38bc01aeca0dbd3416 Writer Reader 2. sendfile(2) 3. write(2) 1. Evict from page cache File

Result: write(2) indeed blocked during sendfile(2) 41 •Injected read-delay into
the device (using dm-delay) •Writing to the file blocked until the sendfile(2) returns $ java App3 /path/to/test.bin [2022-08-08T15:38:07.702003] Took 500 ms to transfer [2022-08-08T15:38:07.702033] Took 450 ms to write ...

Reading Linux kernel source code 42 •Next, we need to
identify the root cause of the contention •After reading the Linux kernel source code, we found that both requires  lock on inode • xfs_file_splice_read - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L336 • xfs_file_buffered_aio_write - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L723

inode lock contention in xfs was the root cause! 43
•There's a discussion in linux-fsdevel ML •https://lore.kernel.org/linux-fsdevel/[email protected]  "xfs: reduce ilock contention on buffered randrw workload"

Overall flow of the phenomenon 44 •1. Consumer started to
read old data, which caused slow sendfile(2)  due to reading data from slow device (HDD) Page Page Log segment file Network Threads Consumer Socket Response Request Handler Threads sendfile(2) Request

Overall flow 45 •2. sendfile(2) blocked produce-request handling on write(2) 
due to the lock contention in xfs Page Page Log segment file Producer Socket Request write(2) Log lock Network Threads Request Handler Threads

Overall flow 46 •3. Meanwhile, other produce requests also blocked
due to lock contention in Kafka Page Page Log segment file Producer Socket Request Producer ... write(2) Log lock Network Threads Request Handler Threads

Overall flow 47 •4. Request handlers got busy, so delayed
to dequeue from request queue  => Resulted in 99th produce response time degradation

Solution 48 •Root cause in short:  Lagging consumer (which requires
old data not in page cache) and  the producer write/reads the same file (inode) Producer Write Consumer Read Log segment file Page Page Page

Solution 49 •Producer always writes to last log segment file
•=> Set smaller max segment file size to rotate segment files earlier  than the data is evicted from the page cache Producer Write Consumer Read Log segment file Page Page Page Old log segment file

Conclusion 50 •We can perform time travel stack trace analysis
using async-profiler JFR output with wall-clock mode •It greatly helps performance issue investigations •Some system calls may cause confusing flame graphs •During executing some system calls (e.g. write(2), sendfile(2)),  signal handlers cannot be executed •=> async-profiler cannot take samples in the meantime •=> May not appear in flame graphs even when actual duration is long

async-profilerによるスタックトレースタイムトラベル - Kafkaのパフォーマン...

async-profilerによるスタックトレースタイムトラベル - Kafkaのパフォーマンス問題調査での応用例

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript