Kafka Broker performance degradation by mysterious JVM pause

Kafka broker performance degradation by mysterious JVM pause

About speaker — Name: Yuto Kawamura — Role: Lead of
team providing company-wide Kafka platform @ LINE — Reliability/Performance engineering + Development for Kafka clusters receiving over 260 billion messages /day — Pubs: — Kafka Summit 2017 6 — Kafka Summit 2018 7 — LINE DEVELOPER DAY 2018 8 8 https://gist.github.com/kawamuray/ a27aaa2d0668d0050537651978735321#file-kprobe_delay-c 7 https://kafka-summit.org/sessions/kafka-multi-tenancy-160-billion- daily-messages-one-shared-cluster-line/ 6 https://kafka-summit.org/sessions/single-data-hub-services-feed-100- billion-messages-per-day/

Phenomenon — Response time degradation for 99th %ile Produce —
Drop out from cluster expiring ZK session timeout — Violates our SLO : 99th %ile Produce < 40ms

Observation: Threads utilization increase — For both network threads and
IO threads — For all of their threads

Observation: GC time (STW) increase 2019-05-13T01:42:44.936+0900: 1423318.149: Total time for
which application threads were stopped: 1.1904424 seconds, Stopping threads took: 1.1883390 seconds

"Stopping threads" part rarely increases 1 — "safepoint" = Moments
in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc...) — "safepoint sync" = Moments that JVM is waiting all threads to voluntarily enter pause 1 http://psy-lob-saw.blogspot.com/2015/12/safepoints.html

Observation: Disk write time increase

Guess what's going on Long-running safepoint sync + Slow response
time from disks = JVM performs slow IO inside safepoint sync/ operation ...? — At this point we were aware of that our machine's RAID controller likely has a performance problem

Suspecting GC logging - Does (append) write(2) blocks? It's true
sometimes but false for the most cases. — write(2) just enables "dirty flag" of the page — Kernel's background task periodically "writeback" dirty pages into block devices => Normally calling write(2) is just about copying memory

Measure write(2) system call's duration Using SystemTap script2. Sun May
19 00:32:03 2019,724 SYSCALL write from PID/TID 136422/136489 for FILE (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel] Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel] 0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so] 0x331fc71943 : _IO_file_write@@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so] 0x331fc72ef5 : _IO_do_write@@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so] 0x331fc720f8 : _IO_file_sync@@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so] 0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so] 0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] (snip) 0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so] 2 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp

So our premise had broken... Even write(2) for appending write
can blocks. Then are we ready to start considering solution?

Not YET — We don't know why write(2) can block
occasionally — Without knowing what's going on underneath, we might take wrong option => Try stack sampling when experiencing slow write(2)

However... 1. We had to service out the broker from
problematic machine because performance was getting worse daily 2. The problematic machine's partially-broken HDD has been replaced accidentally by maintenance — I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)"

Lost the reproducible environment... but never give up! Reproduce write
workload: tiny C program3 with measuring write(2) call duration. (snip) while (1) { if (clock_gettime(CLOCK_MONOTONIC, &t0) == -1) { perror("clock_gettime"); continue; } written = write(fd, content, sizeof(content)); if (clock_gettime(CLOCK_MONOTONIC, &te) == -1) { perror("clock_gettime"); continue; } (snip) 3 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-justwrite-c

Reproduce slow response from disk — Disk slowing down !=
write(2) slowing down — We're now investigating why write(2) slows down! — Attempted to hook generic_make_request4 and inject delay to simulate slow disk with SystemTap — but failed... well SystemTap is designed for guaranteeing safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 (snip) ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5] 4 A generic gateway function in Linux source code for issueing IO request for block devices

! Use kprobe directly writing kernel module 8 kprobe =
Linux kernel API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses this as a backend) static struct kprobe kp = { .symbol_name = "generic_make_request", }; (snip) static int handler_pre(struct kprobe *p, struct pt_regs *regs) { (snip) if (biop->bi_rw & WRITE && bio_rw_flagged(biop, BIO_RW_SYNCIO) && MAJOR(devn) == TARGET_MAJOR && MINOR(devn) == TARGET_MINOR) { if (sampling++ % 3 == 0) { printk(KERN_INFO "throttling\n"); mdelay(300); (snip) static int __init kprobe_init(void) { (snip) kp.pre_handler = handler_pre; ret = register_kprobe(&kp); 8 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c

Successfully reproduced and gotcha! — Found that [jbd2/sda3-8] kernel task
(ext4's journaling daemon) is blocked — but couldn't explain why it affects write(2)'s processing time... Pid: 4390, comm: jbd2/sda3-8 Not tainted ... Call Trace: (snip) [<ffffffffxxxxxxxx>] ? generic_make_request+0x1/0x5a0 <<EOE>> [<ffffffffxxxxxxxx>] ? submit_bio+0x70/0x120 [<ffffffffxxxxxxxx>] ? submit_bh+0x11d/0x1f0 [<ffffffffxxxxxxxx>] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2] (snip) [<ffffffffxxxxxxxx>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffffxxxxxxxx>] ? kthread+0x9e/0xc0

Try approaching from another side Breakdown source of latency in
write(2)'s callgraph. Using another SystemTap script5. probe %(probepoints_entry)s { if (in_rec()) { frmid = alloc_frame_id() entry_times[tid(), frmid] = gettimeofday_us() (snip) probe %(probepoints_return)s { if (in_rec()) { frmid = dealloc_frame_id() tin = entry_times[tid(), frmid] tout = gettimeofday_us() elapsed = tout - tin durations[tid()] .= sprintf(",%%s=%%d(in: %%d, out: %%d)", frmid, elapsed, tin, tout) 5 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py

Adding suspicious part one-by-one (reading kernel source)... ./ddown..py ... 'vfs_write,do_sync_write,ext4/ext4_file_write,generic_file_ai
o_write,__generic_file_aio_write,file_update_time,generic_file_buffered_write,ext4_write_begin' ./ddown..py ... 'vfs_write,do_sync_write,ext4/ext4_file_write,generic_file_ai o_write,__generic_file_aio_write,file_update_time,generic_file_buffered_write,ext4/ext4_write_begin' (snip) ./ddown..py ... 'vfs_write,do_sync_write,ext4/ext4_file_write,generic_file_ai o_write,__generic_file_aio_write,file_update_time,generic_file_buffered_write,ext4/ext4_da_write_beg in,iov_iter_copy_from_user_atomic,ext4/ext4_da_write_end,ext4/ext4_journal_start_sb,grab_cache_page_ write_begin,block_write_begin,ext4/__ext4_journal_stop,jbd2/jbd2_journal_start,jbd2/start_this_handl e,find_lock_page,find_get_page' ./ddown..py ... 'vfs_write,grab_cache_page_write_begin,find_lock_page,find_ge t_page,lock_page' ./ddown..py ... 'vfs_write,grab_cache_page_write_begin,find_lock_page,find_ge t_page,__lock_page' ... 14 lines in total.

Found it's waiting on lock sys_write took 3890322 us in
total:, find_get_page=1, __lock_page=3890280, find_lock_page=3890290, grab_cache_page_write_begin=3890294, vfs_write=3890317

Analyzing lock contention Hooking lock acquisition/release points: __lock_page and unlock_page.
Figured out that it's cascading issue: [processing calling write(2)] -> [flush-8:0] writeback task (waiting page lock) -> [jbd2/sda3-8] journaling daemon (waiting journal_head's buffer state)

Overview of the issue

Overview - GC logging inside safepoint

Overview - Kernel's writeback task blocked by journaling

Overview - Ext4's journaling task blocked long by slow disk
response

Solution? — Use other fs or disable journaling — Use
tmpfs as a 1st storage for GC logs with real-time persistent backup java ... -Xgclog=/path/to/tmpfs # in background tail -F /path/to/tmpfs/gclog-latest > /path/to/persistent/gclog-current # also in scheduled job jcmd to roll gclog in reasonably short moment (like 1 hour) to prevent tmpfs filled up

Summary — When (at least) below conditions are met, write(2)
can (very rarely) blocks — Using ext4 (maybe other journaling fs too) as FS — Underlying disk partially broken and responding slowly — Assuming appending-write(2) to be non-blocking is, — still safe for normal applications — not safe for performance critical part of application affects widely (e.g, JVM's safepoint)

End of presentation.

Kafka Broker performance degradation by mysteri...

Kafka Broker performance degradation by mysterious JVM pause

LINE Developers

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript