Slide 1

Slide 1 text

Time travel stack trace analysis
 with async-profiler Haruki Okada 2023.06.04 The application in Kafka performance issue investigation

Slide 2

Slide 2 text

Speaker 2 •Haruki Okada •Twitter/GitHub: ocadaruma •Lead of LINE IMF team •The team responsible for developing/maintaining company-wide
 Kafka platform

Slide 3

Slide 3 text

Apache Kafka 3 •Streaming middleware •Written in Scala/Java. Runs on JVM

Slide 4

Slide 4 text

Apache Kafka at LINE 4 •One of the most popular middleware in LINE •Use cases: •Log pipeline •Job queueing •Pub/Sub events •Scale: •Daily data in/out: > 2 petabytes •Daily message inflow: > 1 trillion messages •Peak messages/sec: > 28 million messages

Slide 5

Slide 5 text

Performance engineerings 5 •We put a lot of engineering efforts to make our cluster highly performant and reliable •Refs. •“Reliability Engineering Behind The Most Trusted Kafka Platform”
 (LINE DEVELOPER DAY 2019) •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP"
 (LINE DEVELOPER DAY 2021)

Slide 6

Slide 6 text

•Overview of the performance issue on our Kafka cluster •Investigation process •How we use async-profiler •Solution •Conclusion Agenda 6

Slide 7

Slide 7 text

•Produce response time degradation over 600ms on our Kafka broker •Usual produce response time is < 50 ms at 99th Phenomenon 7

Slide 8

Slide 8 text

•High “RequestQueue” time Produce response time breakdown 8

Slide 9

Slide 9 text

How requests are processed in Kafka 9 •“RequestQueue” time increase implies requests are sit in request queue
 for long time

Slide 10

Slide 10 text

10 Request handlers were busy

Slide 11

Slide 11 text

A producer occupied request handler threads 11 •At the incidental time, we observed high thread-time utilization for specific producer •It implies handling requests for this producer took long time

Slide 12

Slide 12 text

Identifying the hot code path: async-profiler 12 •Popular low overhead sampling profiler for Java •https://github.com/jvm-profiling-tools/async-profiler •We use wall-clock profiling which samples all threads regardless
 threads are consuming CPU time or not •profiler.sh -d 30 -e wall -f profile.html $KAFKA_PID

Slide 13

Slide 13 text

13 async-profiler result: •Hottest part was waiting lock on Log.append request-handler-0 request-handler-1 ... Log file Lock append append append

Slide 14

Slide 14 text

14 Identifying the cause of the lock contention •We want to identify: •The thread acquired the lock •The method executed at that time by the thread request-handler-0 request-handler-1 ... Log file Lock append append append THREAD-??? Acquired

Slide 15

Slide 15 text

15 The flame graph didn’t tell much •Curious thing: No other unusually hot code paths on on the flame graph

Slide 16

Slide 16 text

What’s next? 16 •What we want: •Time travel to the incidental time && analyze stack traces
 => HOW? THREAD-??? request-handler-0 ɾ ɾ ɾ pthread_cond_wait 
 at acquire_lock 
 at … method-??? 
 at … 
 at … request-handler-1

Slide 17

Slide 17 text

Why not JFR & JMC? 17 •Java Flight Recorder (JFR) is also a popular profiling tool •JFR provides thread activity timeline

Slide 18

Slide 18 text

Why not JFR & JMC? 18 •Unlike async-profiler, JFR instruments only limited types of I/O
 e.g. •FileRead/Write events by FileChannel/File{Input,Output}Stream/RandomAccessFile •FileForce events by FileChannel •Kafka heavily depends on I/O types which JFR doesn’t support
 e.g. •FileChannel#transferTo: sendfile(2) •MappedByteBuffer: File access through mmap(2)

Slide 19

Slide 19 text

Why not just taking thread dump periodically? 19 •e.g. •while true; do 
 jcmd Thread.print $KAFKA_PID > thread-dump.$(date +%s%3N) 
 sleep 0.1 
 done •Taking thread dump needs to bring the JVM to “safepoint”,
 which all threads are paused meanwhile •It has certain overhead

Slide 20

Slide 20 text

Solution: async-profiler && JFR output! 20 •async-profiler supports to output the profile in JFR (Java Flight Recorder) format •profiler.sh -d 30 -e wall -f profile.jfr $KAFKA_PID

Slide 21

Slide 21 text

Idea: Generate threads timeline from JFR 21

Slide 22

Slide 22 text

Problem: No good tool for visualization 22 •JMC can’t generate threads timeline view from async-profiler JFR output

Slide 23

Slide 23 text

Let’s create a tool then! 23 •jfrv: A web-based tool to visualize async-profiler JFR output •https://ocadaruma.github.io/jfrv Thread list Sample timeline per thread Show thread dump of the highlighted sample

Slide 24

Slide 24 text

Finally we succeeded to visualize threads
 at the incidental time 24

Slide 25

Slide 25 text

Observation:
 One request-handler was writing to the log 25 pthread_cond_wait@@GLIBC_2.3.2 at ObjectMonitor::EnterI at ObjectMonitor::enter at ObjectSynchronizer::slow_enter ... at kafka/log/Log.append:1062 __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158

Slide 26

Slide 26 text

26 Observation: A blank period in profiled samples •One request-handler thread was writing to the log •Which is done inside the lock •There’s a gap in profiled samples for the thread
 => What does it mean? __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158 pthread_cond_wait@@GLIBC_2.3.2 at ObjectMonitor::EnterI at ObjectMonitor::enter at ObjectSynchronizer::slow_enter ... at kafka/log/Log.append:1062

Slide 27

Slide 27 text

How async-profiler (wall-clock mode) works 27

Slide 28

Slide 28 text

Threads cannot execute signal handler
 in certain circumstances 28 TASK_UNINTERRUPTIBLE: The process is waiting on certain special classes of event, such as the completion of a disk I/O. If a signal is generated for a process in this state, then the signal is not delivered until the process emerges from this state. -- The Linux Programming Interface by Michael Kerrisk

Slide 29

Slide 29 text

Long running write(2) => 
 Signal handling delay => Blank period in profiled samples 29

Slide 30

Slide 30 text

This explains why we can’t see any clue from the flame graph 30 •write(2) = 1 sample, foo_method = 2 samples •Though write(2) takes long time than foo_method actually

Slide 31

Slide 31 text

Flame graph doesn’t account samples correctly
 when threads are in TASK_UNINTERRUPTIBLE 31 •Flame graph is a visualization which represents sample count by bar length •May not reflect “actual duration” correctly due to the signal handling delay thread (6 samples) Thread.sleep foo_method write(2)

Slide 32

Slide 32 text

32 i.e. write(2) took long time actually •The gap was likely due to the signal handling delay for long running write(2) __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158

Slide 33

Slide 33 text

Why write(2) took long time? 33 •write(2) is not supposed to block in general •In Linux, what write(2) does is just copying data to “page” in memory •Kernel's writeback task does writing data to the underlying device asynchronously

Slide 34

Slide 34 text

Observation: Simultaneous
 blank period in another thread 34 __GI_sendfile64 at sun/nio/ch/FileChannelImpl.transferTo0 at sun/nio/ch/FileChannelImpl.transferToDirectlyInternal:501 at sun/nio/ch/FileChannelImpl.transferToDirectly:566 ... at org/apache/kafka/common/network/KafkaChannel.send:429 __write at sun/nio/ch/FileDispatcherImpl.write0 at sun/nio/ch/FileDispatcherImpl.write:62 at sun/nio/ch/IOUtil.writeFromNativeBuffer:113 ... at kafka/log/LogSegment.append:158 •i.e. sendfile(2) took long time at the same time

Slide 35

Slide 35 text

How does Kafka use sendfile(2) 35 •sendfile(2) is a system call that copies data from one file descriptor to another •Kafka uses sendfile(2) to send records directly to the consumer to avoid
 copying data to user space Network Threads Consumer Request Socket Filesystem Response ... sendfile(2)

Slide 36

Slide 36 text

Does sendfile(2) block? 36 •Kafka is a streaming middleware.
 Usually, consumers read latest data from tail of the log •Data is "hot" enough so likely resides in page cache (memory) Producer Page Page Page Write Consumer Read Log segment file Not in page cache

Slide 37

Slide 37 text

sendfile(2) could block 37 •However, consumers sometimes try to read old data for catch-up •If the data is not in page cache, sendfile(2) needs to read the data from
 the device (HDD), which is significantly slow •refs: https://logmi.jp/tech/articles/320330 Producer Write Consumer Read Log segment file Page Page Page Not in page cache

Slide 38

Slide 38 text

Actually, there was a lagging consumer at the incidental time 38 •This is highly likely the cause of blocking sendfile(2)

Slide 39

Slide 39 text

How do slow sendfile(2) and slow write(2) relate? 39 •On Kafka code level, network-thread and request-handler doesn’t share any lock •write(2) and sendfile(2) just slowed down at the same time •Hypothesis: Lock contention inside Kernel Log segment file request-handler thread network thread Lock???

Slide 40

Slide 40 text

Experiment 40 •Wrote a small code to check if the hypothesis is possible or not •https://gist.github.com/ocadaruma/691a3527fe65bf38bc01aeca0dbd3416 Writer Reader 2. sendfile(2) 3. write(2) 1. Evict from page cache File

Slide 41

Slide 41 text

Result: write(2) indeed blocked during sendfile(2) 41 •Injected read-delay into the device (using dm-delay) •Writing to the file blocked until the sendfile(2) returns $ java App3 /path/to/test.bin [2022-08-08T15:38:07.702003] Took 500 ms to transfer [2022-08-08T15:38:07.702033] Took 450 ms to write ...

Slide 42

Slide 42 text

Reading Linux kernel source code 42 •Next, we need to identify the root cause of the contention •After reading the Linux kernel source code, we found that both requires
 lock on inode • xfs_file_splice_read - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L336 • xfs_file_buffered_aio_write - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L723

Slide 43

Slide 43 text

inode lock contention in xfs was the root cause! 43 •There's a discussion in linux-fsdevel ML •https://lore.kernel.org/linux-fsdevel/20190404165737.30889-1-amir73il@gmail.com
 "xfs: reduce ilock contention on buffered randrw workload"

Slide 44

Slide 44 text

Overall flow of the phenomenon 44 •1. Consumer started to read old data, which caused slow sendfile(2)
 due to reading data from slow device (HDD) Page Page Log segment file Network Threads Consumer Socket Response Request Handler Threads sendfile(2) Request

Slide 45

Slide 45 text

Overall flow 45 •2. sendfile(2) blocked produce-request handling on write(2)
 due to the lock contention in xfs Page Page Log segment file Producer Socket Request write(2) Log lock Network Threads Request Handler Threads

Slide 46

Slide 46 text

Overall flow 46 •3. Meanwhile, other produce requests also blocked due to lock contention in Kafka Page Page Log segment file Producer Socket Request Producer ... write(2) Log lock Network Threads Request Handler Threads

Slide 47

Slide 47 text

Overall flow 47 •4. Request handlers got busy, so delayed to dequeue from request queue
 => Resulted in 99th produce response time degradation

Slide 48

Slide 48 text

Solution 48 •Root cause in short:
 Lagging consumer (which requires old data not in page cache) and
 the producer write/reads the same file (inode) Producer Write Consumer Read Log segment file Page Page Page

Slide 49

Slide 49 text

Solution 49 •Producer always writes to last log segment file •=> Set smaller max segment file size to rotate segment files earlier
 than the data is evicted from the page cache Producer Write Consumer Read Log segment file Page Page Page Old log segment file

Slide 50

Slide 50 text

Conclusion 50 •We can perform time travel stack trace analysis using async-profiler JFR output with wall-clock mode •It greatly helps performance issue investigations •Some system calls may cause confusing flame graphs •During executing some system calls (e.g. write(2), sendfile(2)),
 signal handlers cannot be executed •=> async-profiler cannot take samples in the meantime •=> May not appear in flame graphs even when actual duration is long