Kafka Broker performance degradation by mysterious JVM pause

Kafka Broker performance degradation by mysterious JVM pause

"Kafka Broker performance degradation by mysterious JVM pause"
Yuto Kawamura (LINE Corporation)

Presentation at Apache Kafka Meetup Japan #7 @LINE (17th July 2019)


LINE Developers

July 17, 2019


  1. Kafka broker performance degradation by mysterious JVM pause

  2. About speaker — Name: Yuto Kawamura — Role: Lead of

    team providing company-wide Kafka platform @ LINE — Reliability/Performance engineering + Development for Kafka clusters receiving over 260 billion messages /day — Pubs: — Kafka Summit 2017 6 — Kafka Summit 2018 7 — LINE DEVELOPER DAY 2018 8 8 https://gist.github.com/kawamuray/ a27aaa2d0668d0050537651978735321#file-kprobe_delay-c 7 https://kafka-summit.org/sessions/kafka-multi-tenancy-160-billion- daily-messages-one-shared-cluster-line/ 6 https://kafka-summit.org/sessions/single-data-hub-services-feed-100- billion-messages-per-day/
  3. Phenomenon — Response time degradation for 99th %ile Produce —

    Drop out from cluster expiring ZK session timeout — Violates our SLO : 99th %ile Produce < 40ms
  4. Observation: Threads utilization increase — For both network threads and

    IO threads — For all of their threads
  5. Observation: GC time (STW) increase 2019-05-13T01:42:44.936+0900: 1423318.149: Total time for

    which application threads were stopped: 1.1904424 seconds, Stopping threads took: 1.1883390 seconds
  6. "Stopping threads" part rarely increases 1 — "safepoint" = Moments

    in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc...) — "safepoint sync" = Moments that JVM is waiting all threads to voluntarily enter pause 1 http://psy-lob-saw.blogspot.com/2015/12/safepoints.html
  7. Observation: Disk write time increase

  8. Guess what's going on Long-running safepoint sync + Slow response

    time from disks = JVM performs slow IO inside safepoint sync/ operation ...? — At this point we were aware of that our machine's RAID controller likely has a performance problem
  9. Suspecting GC logging - Does (append) write(2) blocks? It's true

    sometimes but false for the most cases. — write(2) just enables "dirty flag" of the page — Kernel's background task periodically "writeback" dirty pages into block devices => Normally calling write(2) is just about copying memory
  10. Measure write(2) system call's duration Using SystemTap script2. Sun May

    19 00:32:03 2019,724 SYSCALL write from PID/TID 136422/136489 for FILE (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel] Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel] 0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so] 0x331fc71943 : _IO_file_write@@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so] 0x331fc72ef5 : _IO_do_write@@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so] 0x331fc720f8 : _IO_file_sync@@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so] 0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so] 0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] (snip) 0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so] 2 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp
  11. So our premise had broken... Even write(2) for appending write

    can blocks. Then are we ready to start considering solution?
  12. Not YET — We don't know why write(2) can block

    occasionally — Without knowing what's going on underneath, we might take wrong option => Try stack sampling when experiencing slow write(2)
  13. However... 1. We had to service out the broker from

    problematic machine because performance was getting worse daily 2. The problematic machine's partially-broken HDD has been replaced accidentally by maintenance — I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)"
  14. Lost the reproducible environment... but never give up! Reproduce write

    workload: tiny C program3 with measuring write(2) call duration. (snip) while (1) { if (clock_gettime(CLOCK_MONOTONIC, &t0) == -1) { perror("clock_gettime"); continue; } written = write(fd, content, sizeof(content)); if (clock_gettime(CLOCK_MONOTONIC, &te) == -1) { perror("clock_gettime"); continue; } (snip) 3 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-justwrite-c
  15. Reproduce slow response from disk — Disk slowing down !=

    write(2) slowing down — We're now investigating why write(2) slows down! — Attempted to hook generic_make_request4 and inject delay to simulate slow disk with SystemTap — but failed... well SystemTap is designed for guaranteeing safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 (snip) ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5] 4 A generic gateway function in Linux source code for issueing IO request for block devices
  16. ! Use kprobe directly writing kernel module 8 kprobe =

    Linux kernel API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses this as a backend) static struct kprobe kp = { .symbol_name = "generic_make_request", }; (snip) static int handler_pre(struct kprobe *p, struct pt_regs *regs) { (snip) if (biop->bi_rw & WRITE && bio_rw_flagged(biop, BIO_RW_SYNCIO) && MAJOR(devn) == TARGET_MAJOR && MINOR(devn) == TARGET_MINOR) { if (sampling++ % 3 == 0) { printk(KERN_INFO "throttling\n"); mdelay(300); (snip) static int __init kprobe_init(void) { (snip) kp.pre_handler = handler_pre; ret = register_kprobe(&kp); 8 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c
  17. Successfully reproduced and gotcha! — Found that [jbd2/sda3-8] kernel task

    (ext4's journaling daemon) is blocked — but couldn't explain why it affects write(2)'s processing time... Pid: 4390, comm: jbd2/sda3-8 Not tainted ... Call Trace: (snip) [<ffffffffxxxxxxxx>] ? generic_make_request+0x1/0x5a0 <<EOE>> [<ffffffffxxxxxxxx>] ? submit_bio+0x70/0x120 [<ffffffffxxxxxxxx>] ? submit_bh+0x11d/0x1f0 [<ffffffffxxxxxxxx>] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2] (snip) [<ffffffffxxxxxxxx>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffffxxxxxxxx>] ? kthread+0x9e/0xc0
  18. Try approaching from another side Breakdown source of latency in

    write(2)'s callgraph. Using another SystemTap script5. probe %(probepoints_entry)s { if (in_rec()) { frmid = alloc_frame_id() entry_times[tid(), frmid] = gettimeofday_us() (snip) probe %(probepoints_return)s { if (in_rec()) { frmid = dealloc_frame_id() tin = entry_times[tid(), frmid] tout = gettimeofday_us() elapsed = tout - tin durations[tid()] .= sprintf(",%%s=%%d(in: %%d, out: %%d)", frmid, elapsed, tin, tout) 5 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py
  19. Adding suspicious part one-by-one (reading kernel source)... ./ddown..py ... 'vfs_write,do_sync_write,ext4/ext4_file_write,generic_file_ai

    o_write,__generic_file_aio_write,file_update_time,generic_file_buffered_write,ext4_write_begin' ./ddown..py ... 'vfs_write,do_sync_write,ext4/ext4_file_write,generic_file_ai o_write,__generic_file_aio_write,file_update_time,generic_file_buffered_write,ext4/ext4_write_begin' (snip) ./ddown..py ... 'vfs_write,do_sync_write,ext4/ext4_file_write,generic_file_ai o_write,__generic_file_aio_write,file_update_time,generic_file_buffered_write,ext4/ext4_da_write_beg in,iov_iter_copy_from_user_atomic,ext4/ext4_da_write_end,ext4/ext4_journal_start_sb,grab_cache_page_ write_begin,block_write_begin,ext4/__ext4_journal_stop,jbd2/jbd2_journal_start,jbd2/start_this_handl e,find_lock_page,find_get_page' ./ddown..py ... 'vfs_write,grab_cache_page_write_begin,find_lock_page,find_ge t_page,lock_page' ./ddown..py ... 'vfs_write,grab_cache_page_write_begin,find_lock_page,find_ge t_page,__lock_page' ... 14 lines in total.
  20. Found it's waiting on lock sys_write took 3890322 us in

    total:, find_get_page=1, __lock_page=3890280, find_lock_page=3890290, grab_cache_page_write_begin=3890294, vfs_write=3890317
  21. Analyzing lock contention Hooking lock acquisition/release points: __lock_page and unlock_page.

    Figured out that it's cascading issue: [processing calling write(2)] -> [flush-8:0] writeback task (waiting page lock) -> [jbd2/sda3-8] journaling daemon (waiting journal_head's buffer state)
  22. Overview of the issue

  23. Overview - GC logging inside safepoint

  24. Overview - Kernel's writeback task blocked by journaling

  25. Overview - Ext4's journaling task blocked long by slow disk

  26. Solution? — Use other fs or disable journaling — Use

    tmpfs as a 1st storage for GC logs with real-time persistent backup java ... -Xgclog=/path/to/tmpfs # in background tail -F /path/to/tmpfs/gclog-latest > /path/to/persistent/gclog-current # also in scheduled job jcmd to roll gclog in reasonably short moment (like 1 hour) to prevent tmpfs filled up
  27. Summary — When (at least) below conditions are met, write(2)

    can (very rarely) blocks — Using ext4 (maybe other journaling fs too) as FS — Underlying disk partially broken and responding slowly — Assuming appending-write(2) to be non-blocking is, — still safe for normal applications — not safe for performance critical part of application affects widely (e.g, JVM's safepoint)
  28. End of presentation.