team •The team is responsible for developing/maintaining company-wide Kafka platform •Kafka contributor •KIP-764: Configurable backlog size for creating Acceptor •Patch for KAFKA-13572
to make our cluster highly performant and reliable •e.g. •“Reliability Engineering Behind The Most Trusted Kafka Platform” (LINE DEVELOPER DAY 2019) •"Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP" (LINE DEVELOPER DAY 2021)
Handler Threads Request Queue Response Queue Producer •RequestQueue time increase implies request handler threads were busy and delayed to dequeue from request queue Request Response
many requests were sent •=> No. We could see no request count spike •Suspect 2: Took long time to process requests •=> Possible. At the incidental time, we observed high utilization from specific client. This implies it took long time to process requests from the client
overhead sampling profiler •https://github.com/jvm-profiling-tools/async-profiler •We succeeded to take the profile when the phenomenon recurred •Use wall-clock profiling which samples all threads regardless a thread consumed CPU time or not •Profile per thread separately •profiler.sh -d 30 -e wall -t -f profile.html $KAFKA_PID
implies that there was a thread executing time-consuming frame with holding the lock •We want to identify the thread and the frame •We need timeline of thread dumps request-handler-0 request-handler-1 ɾɾɾ pthread_cond_wait some_blocking_method
the profile in JFR (Java Flight Recorder) format •Series of profile samples that consisting of •Timestamp •Thread •Thread state •Stack trace •=> We can construct timeline of thread dumps from JFR output •profiler.sh -d 30 -e wall -t -f profile.jfr $KAFKA_PID
Async profiler's JFR output •https://ocadaruma.github.io/jfrv/#/execution-sample Thread list Sample timeline per thread Thread dump in the clicked sample
sometimes blocked while other request handlers were waiting lock to acquire •Could be the direct cause of the busy request handlers. i.e. •One request handler thread acquired lock to write produced data to the log segment => •`write` blocked somehow •Meanwhile, there were produce requests to the same partition but all of them got block
is not supposed to block in general •In Linux, what `write` does is just copying data to `page` in memory •Kernel's writeback task does writing data to actual disk asynchronously request-handler page HDD write Writeback task produced data write asynchronously
call that copies data from one file descriptor to another •Kafka uses `sendfile` to send records directly to the consumer to avoid copying data to user space Network Threads Consumer Request Socket Filesystem Response ... `sendfile`
So usually, consumers read latest data from tail of the log •Data is "hot" enough so likely resides in page cache (memory) => Very fast Producer Page Page ... Write Consumer Read Log segment file
old data for catch-up •If the data about to be copied by `sendfile` is not in page cache, it's necessary to read the data from the device (HDD), which is significantly slow •refs: https://logmi.jp/tech/articles/320330 "Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages" Producer Page Page ... Write Consumer Read Log segment file
hypothesis is possible or not •https://gist.github.com/ocadaruma/691a3527fe65bf38bc01aeca0dbd3416 Writer Reader 2. sendfile 3. write 1. Evict from page cache File
the device (using dm-delay) •Writing to the file blocked until the sendfile has return $ java App3 /path/to/test.bin [2022-08-08T15:38:07.702003] Took 500 ms to transfer [2022-08-08T15:38:07.702033] Took 450 ms to write ...
actually contend •Next, we need to identify the root cause of the contention •After reading the Linux kernel source code, we found that both requires lock on inode • xfs_file_splice_read - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L336 • xfs_file_buffered_aio_write - https://github.com/torvalds/linux/blob/v3.10/fs/xfs/xfs_file.c#L723
active (last) log segment was evicted from page cache (Produced data is appended to active segment) •Hence blocking `sendfile` (for catch-up consumers) happened while there were `write` (for producers) •=> Set smaller log segment size to ensure active segment is in page cache •`log.segment.bytes` config
•Kafka opens file descriptors for all of them •Also, potentially creates memory map for them •So we adjusted `nofile` and `vm.max_map_count` respectively •Result: •We rarely observe performance degradations by same cause so far