Slide 1

Slide 1 text

Investigating Kafka performance issue caused by lock contention in xfs Haruki Okada 2022.12.16

Slide 2

Slide 2 text

Speaker 2 •Haruki Okada (Twitter/GitHub: ocadaruma) •Lead of LINE IMF team •The team is responsible for developing/maintaining company-wide
 Kafka platform •Kafka contributor •KIP-764: Configurable backlog size for creating Acceptor •Patch for KAFKA-13572

Slide 3

Slide 3 text

Kafka at LINE 3 •One of the most popular middleware in LINE •We provide multi-tenant shared Kafka cluster •Our largest cluster: •Daily data in/out: 2.1PB •Daily message inflow: 1.2T messages

Slide 4

Slide 4 text

Performance engineerings 4 •We put a lot of engineering efforts to make our cluster highly performant and reliable •e.g. •“Reliability Engineering Behind The Most Trusted Kafka Platform”
 (LINE DEVELOPER DAY 2019) •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP"
 (LINE DEVELOPER DAY 2021)

Slide 5

Slide 5 text

•Overview of the phenomenon we encountered •Investigation process •Solution •Conclusion Agenda 5

Slide 6

Slide 6 text

•One of our brokers experienced high produce response time (over 600ms at 99th) •Usual produce response time is ~50 ms at 99th •Violates our cluster's SLO. Serious performance issue Phenomenon 6

Slide 7

Slide 7 text

How requests are processed in Kafka 7 Network Threads Request Handler Threads Request Queue Response Queue Producer •RequestQueue time increase implies request handler threads were busy
 and delayed to dequeue from request queue Request Response

Slide 8

Slide 8 text

8 Network Threads Request Handler Threads Request Queue Response Queue Producer Request Response Actually, request handlers were busy

Slide 9

Slide 9 text

What makes request handlers to busy 9 •Suspect 1: Too many requests were sent •=> No. We could see no request count spike •Suspect 2: Took long time to process requests •=> Possible. At the incidental time, we observed high utilization from specific
 client. This implies it took long time to process requests from the client

Slide 10

Slide 10 text

Identifying the hot code path: Async profiler 10 •Famous low overhead sampling profiler •https://github.com/jvm-profiling-tools/async-profiler •We succeeded to take the profile when the phenomenon recurred •Use wall-clock profiling which samples all threads regardless a thread
 consumed CPU time or not •Profile per thread separately •profiler.sh -d 30 -e wall -t -f profile.html $KAFKA_PID

Slide 11

Slide 11 text

11 Some request handlers spent significant time on waiting to
 acquire lock when appending records to the log request-handler-0 request-handler-1 request-handler-2 ... Log Lock append

Slide 12

Slide 12 text

Identify the cause of the lock contention 12 •The profile implies that there was a thread executing time-consuming frame
 with holding the lock •We want to identify the thread and the frame •We need timeline of thread dumps request-handler-0 request-handler-1 ɾɾɾ pthread_cond_wait some_blocking_method

Slide 13

Slide 13 text

Async profiler's JFR output 13 •Async profiler supports to output the profile in JFR (Java Flight Recorder) format •Series of profile samples that consisting of •Timestamp •Thread •Thread state •Stack trace •=> We can construct timeline of thread dumps from JFR output •profiler.sh -d 30 -e wall -t -f profile.jfr $KAFKA_PID

Slide 14

Slide 14 text

Visualizing the JFR output 14 •jfrv: A tool to visualize Async profiler's JFR output •https://ocadaruma.github.io/jfrv/#/execution-sample Thread list Sample timeline per thread Thread dump in the clicked sample

Slide 15

Slide 15 text

Observation: blocking `write` when appending records 15 @__write sun/nio/ch/FileDispatcherImpl@write0 sun/nio/ch/FileDispatcherImpl@write sun/nio/ch/IOUtil@writeFromNativeBuffer sun/nio/ch/IOUtil@write sun/nio/ch/FileChannelImpl@write org/apache/kafka/common/record/MemoryRecords@writeFullyTo org/apache/kafka/common/record/FileRecords@append kafka/log/LogSegment@append

Slide 16

Slide 16 text

Observation: blocking `write` 16 •We found that `write` system call sometimes blocked while other request handlers were waiting lock to acquire •Could be the direct cause of the busy request handlers. i.e. •One request handler thread acquired lock to write produced data to the log segment => •`write` blocked somehow •Meanwhile, there were produce requests to the same partition but
 all of them got block

Slide 17

Slide 17 text

Does `write` block ? 17 •In the first place, `write` is not supposed to block in general •In Linux, what `write` does is just copying data to `page` in memory •Kernel's writeback task does writing data to actual disk asynchronously request-handler page HDD write Writeback task produced data write asynchronously

Slide 18

Slide 18 text

Observation: simultaneous
 blocking `sendfile` 18 @__GI_sendfile64 sun/nio/ch/FileChannelImpl@transferTo0 sun/nio/ch/FileChannelImpl@transferToDirectlyInternal sun/nio/ch/FileChannelImpl@transferToDirectly sun/nio/ch/FileChannelImpl@transferTo org/apache/kafka/common/network/ PlaintextTransportLayer@transferFrom @__write sun/nio/ch/FileDispatcherImpl@write0 sun/nio/ch/FileDispatcherImpl@write sun/nio/ch/IOUtil@writeFromNativeBuffer sun/nio/ch/IOUtil@write sun/nio/ch/FileChannelImpl@write org/apache/kafka/common/record/MemoryRecords@writeFullyTo org/apache/kafka/common/record/FileRecords@append kafka/log/LogSegment@append

Slide 19

Slide 19 text

How does Kafka use `sendfile` 19 •`sendfile` is a system call that copies data from one file descriptor to another •Kafka uses `sendfile` to send records directly to the consumer to avoid
 copying data to user space Network Threads Consumer Request Socket Filesystem Response ... `sendfile`

Slide 20

Slide 20 text

Does `sendfile` block ? 20 •Kafka is a streaming middleware. So usually, consumers read latest data from
 tail of the log •Data is "hot" enough so likely resides in page cache (memory) => Very fast Producer Page Page ... Write Consumer Read Log segment file

Slide 21

Slide 21 text

`sendfile` could block 21 •However, consumers sometimes try to read old data for catch-up •If the data about to be copied by `sendfile` is not in page cache, it's necessary
 to read the data from the device (HDD), which is significantly slow •refs: https://logmi.jp/tech/articles/320330
 "Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages" Producer Page Page ... Write Consumer Read Log segment file

Slide 22

Slide 22 text

Actually, there was a lagging consumer 22 •This is highly likely the cause of blocking `sendfile`

Slide 23

Slide 23 text

How do blocking `sendfile` and blocking `write` relate 23 •Hypothesis: Some lock contention inside Kernel....? Log segment file request-handler thread network thread Lock???

Slide 24

Slide 24 text

Experiment 24 •Wrote a small code to check if the hypothesis is possible or not •https://gist.github.com/ocadaruma/691a3527fe65bf38bc01aeca0dbd3416 Writer Reader 2. sendfile 3. write 1. Evict from page cache File

Slide 25

Slide 25 text

Result: `write` indeed blocked during `sendfile` 25 •Injected read-delay into the device (using dm-delay) •Writing to the file blocked until the sendfile has return $ java App3 /path/to/test.bin [2022-08-08T15:38:07.702003] Took 500 ms to transfer [2022-08-08T15:38:07.702033] Took 450 ms to write ...

Slide 26

Slide 26 text

Reading Linux kernel source code 26 •`write` and `sendfile` could actually contend •Next, we need to identify the root cause of the contention •After reading the Linux kernel source code, we found that both requires
 lock on inode • xfs_file_splice_read - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L336 • xfs_file_buffered_aio_write - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L723

Slide 27

Slide 27 text

inode lock contention in xfs 27 •There's a discussion in linux-fsdevel ML •https://lore.kernel.org/linux-fsdevel/[email protected]
 "xfs: reduce ilock contention on buffered randrw workload"

Slide 28

Slide 28 text

Overall flow 28 •1. Consumer started to read old data, which caused blocking `sendfile`
 due to reading data from slow device (HDD) Page ... Log segment file Network Threads Consumer Socket Response Request Handler Threads `sendfile` Request

Slide 29

Slide 29 text

Overall flow 29 •2. `sendfile` blocked produce-request handling on `write` (for appending data)
 due to lock contention in xfs Page ... Log segment file Producer Socket Request `write` Log lock Network Threads Request Handler Threads

Slide 30

Slide 30 text

Overall flow 30 •3. Meanwhile, other produce requests also blocked due to lock contention in Kafka Page ... Log segment file Producer Socket Request Producer ... `write` Log lock Network Threads Request Handler Threads

Slide 31

Slide 31 text

Overall flow 31 •4. Request handlers got busy, so delayed to dequeue from request queue
 => Resulted in 99th produce response time degradation

Slide 32

Slide 32 text

Solution 32 •In short, the problem was: •Some parts of active (last) log segment was evicted from page cache
 (Produced data is appended to active segment) •Hence blocking `sendfile` (for catch-up consumers) happened while
 there were `write` (for producers) •=> Set smaller log segment size to ensure active segment is in page cache •`log.segment.bytes` config

Slide 33

Slide 33 text

Solution 33 •Side effect: •Causes many log segment files => •Kafka opens file descriptors for all of them •Also, potentially creates memory map for them •So we adjusted `nofile` and `vm.max_map_count` respectively •Result: •We rarely observe performance degradations by same cause so far

Slide 34

Slide 34 text

Conclusion 34 •Lock contention in xfs could cause subtle performance issue on Kafka •Async-profiler's JFR output is quite useful for investigation
 to see the interactions between threads