Reliability Engineering Behind The Most Trusted Kafka Platform

Reliability Engineering Behind The Most Trusted Kafka Platform

Yuto Kawamura
LINE Z Part Team Software Engineer
https://linedevday.linecorp.com/jp/2019/sessions/A1-4

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 20, 2019
Tweet

Transcript

  1. 2019 DevDay Reliability Engineering Behind the Most Trusted Kafka Platform

    > Yuto Kawamura > LINE Z Part Team Software Engineer
  2. Speaker > Yuto Kawamura > Senior Software Engineer @ Z

    Part > Lead of Kafka Platform team > ❤ Performance Engineering > Apache Kafka Contributor > Past speaking • Kafka Summit 2017/2018 • LINE DEVELOPER DAY 2018
  3. Apache Kafka At LINE

  4. > ❌ How we use Kafka (or How great Kafka

    is) Today’s Message > ❌ Best Practices for Using Kafka > ✅ Our Reliability Engineering Philosophy, Applicable Techniques of Reliability Engineering
  5. > High-scalability/High-performance distributed system > Default data persistency > Streaming

    Data Middleware Apache Kafka > Pub-Sub model for distributing data > Written in Scala, run on JVM
  6. > Background asynchronous task processing > Event propagation among services

    Typical Usage Service A Service B Service C Service A WebApp Service A
 Processor Queueing Pub-Sub
  7. Daily Inflow Messages 360 Billion User Services Over 60 Daily

    IN+OUT Data 550 TB Kafka Is One of Most Major Middleware at LINE
  8. Service Level Objective (SLO) Cluster Availability 99.999% 99th Percentile Response

    Time < 40ms
  9. Challenges Combined 360 Billion / Day Traffic Scale < 40ms

    Response Time SLO Extreme Challenge
  10. > SLO Visualization Reliability Engineering Kinds > Automation • Workflow

    consistency • Free us from bothering work to concentrate on advanced works > Architecture Proposal • Provide library • Demonstrate Architecture for higher system availability/performance > Troubleshooting and Preventive Engineering
  11. SLO Visualization

  12. > API response time • For latency > API error

    rate • For availability SLI To Measure SLO > SLI = Service Level Indicator • A measurable value which represents the service performance directly > Normally measured in server (broker) side
  13. > Metrics from servers are not enough SLI Measurement in

    Distributed Systems > Message Delivery Latency through Kafka = • Response time for Producer request • + Time taken for fail-over, retries, metadata update • + Response time for Consumer request • + Time taken for Consumer group coordination > Client library works cooperatively to server clusters for higher availability • retry • failover
  14. > We can’t use metrics from actual service clients •

    Business logic affects measurements • We (platform provider) must define the right model (config, usage) of “clients” Measure SLI From Client’s Viewpoint > Measure performance of entire system/flow • What clients actually see after all attempts inside client library
  15. > Measure Performance of cluster from client (user)’s viewpoint Prober

    > A dedicated Kafka clients to measure performance of entire stack > Run Distributed with multiple instances to • Simulate realistic usage of Kafka clients • Latency by fail-over • Distributed partitions assignment • To not lost metrics during restart (rolling restart)
  16. Prober - Architecture Redis Prober Prober Kafka Cluster Broker 1

    Broker 1 Broker 1 Broker 2 Broker 1 Broker 3 Writer Reader Reporter 1. Produce 2. Report Write 3. Consume 4. Report Read Failover Duration
  17. Redis Prober - Architecture Prober Writer Reader Reporter W2P W1P

    W3P … W2C W1C W3C … Write Buckets Read Buckets W1P - W2C
 ————— = Loss Rate W1P Produce Consume Prometheus Metric Time
  18. Prober - Scenario Configuration > Configure Multiple “Scenarios” • Scenario

    = Client Impl x Client Config x Throughput/Latency Config • Measure Consistent Metrics for different setup / different requirement > Add 15 lines of YAML, get all measurements
  19. SLI Dashboard

  20. Troubleshooting

  21. Despite of our Engineering Effort… > Trouble happens • Again,

    large scale + high service level requirement = Trouble Factory > What makes the last difference is attitude against troubles difficult to solve • Try to clear out essential root cause? • Or just workaround and forget?
  22. Fundamental Solution Making > Our Team’s philosophy is to always

    persuade “Fundamental Solution” • E.g, given GC program, • Try changing JVM options around and see what happens? • Or do full-inspection to clear out why GC’s taking so long? > Fundamental Solution Making is better than workaround • Accumulates deep knowledge inside us • Applicable for similar but different issues in the same domain • Leaves improvements permanently
  23. Troubleshooting - an Example

  24. Symptom > API Response time degradation in 99th %ile •

    Violates SLO! > Drop out from cluster by Zookeeper session timeout • If *2 then OUTAGE!
  25. Observation - Thread Utilization Increase > Graph is showing threads

    "idle ratio" so util is reverse > Equally for all threads • Request handler threads • Network threads
  26. Observation - JVM Pause Time Increase 2019-05-13T01:42:44.936+0900: 1423318.149: Total Time

    for Which Application Threads Were Stopped: 1.1904424 Seconds, Stopping Threads Took: 1.1883390 Seconds
  27. Study - “Stopping Threads” Rarely Increase http://psy-lob-saw.blogspot.com/2015/12/safepoints.html > "safepoint" :

    Moments in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc…) > "safepoint sync" : Moments that JVM is waiting all threads to voluntarily enter pause • => This is the increasing part
  28. Observation - Disk Access Time Increase > At the same

    time we saw the symptom
  29. Guess What’s Going On > Long-running safepoint sync > +

    Slow response time from disks > = JVM performs slow IO inside safepoint sync/operation ...? > In fact JVM writes logs on every safepoint occurrence
  30. Study - Does write(2) System Call Blocks? > False for

    the most of the cases > write(2) just enables "dirty flag" of the page > Kernel's background task periodically "writeback" dirty pages into block devices > In case calling write(2) is just about copying memory write(2) System Call Page Page Page Writeback Task Disk
  31. Measure write(2) System Call’s Duration > Using SystemTap script *1

    *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp > SystemTap : dynamic tracing tool • Enables to inspect events inside the Linux kernel • APIs for scripting probe kernel.function("sys_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function(“sys_write").return { duration = gettimeofday_us() - time_entry[tid()] ...
  32. Measure write(2) System Call’s Duration Sun May 19 00:32:03 2019,724

    SYSCALL write from PID/TID 136422/136489 for FILE (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel] Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel] 0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so] 0x331fc71943 : _IO_file_write@@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so] 0x331fc72ef5 : _IO_do_write@@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so] 0x331fc720f8 : _IO_file_sync@@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so] 0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so] 0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] (snip) 0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so]
  33. So the Premise Has Broken > Even write(2) system calls

    for appending file content blocks in some situation > Then are we ready to start considering solution?
  34. > We don't know yet why write(2) can block occasionally

    > Without knowing what's going on underneath, we might take wrong option > => Try stack sampling when experiencing slow write(2) Fundamental Solution Born From Fundamental Understanding
  35. However… > We had to service out the broker from

    problematic machine because performance was getting worse daily > The problematic machine's partially-broken HDD has been replaced accidentally by maintenance • I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)" #
  36. Lost Reproducible Environment but… > Don’t give up! > Two

    things to reproduce : • Write workload by Kafka broker • Slowing down disk device
  37. Write Workload by Kafka Broker > Tiny C program with

    measuring write(2) call duration
  38. Slowing Down Disk Device > Disk slowing down != write(2)

    slowing down • We're now investigating why write(2) slows down > Attempted to hook `generic_make_request` and inject delay to simulate slow disk with SystemTap • but failed... well SystemTap is designed for safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 … ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5]
  39. $Use kprobe Writing Kernel Module > kprobe : Linux kernel

    API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses it as the backend) * https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c
  40. Successfully Reproduced! > Found that `[jbd2/sda3-8]` kernel task (ext4's journaling

    daemon) is blocked > but couldn't explain why it affects `write(2)`'s processing time… Pid: 4390, comm: jbd2/sda3-8 Not tainted ... Call Trace: … [<ffffffffxxxxxxxx>] ? generic_make_request+0x1/0x5a0 <<EOE>> [<ffffffffxxxxxxxx>] ? submit_bio+0x70/0x120 [<ffffffffxxxxxxxx>] ? submit_bh+0x11d/0x1f0 [<ffffffffxxxxxxxx>] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2] … [<ffffffffxxxxxxxx>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffffxxxxxxxx>] ? kthread+0x9e/0xc0
  41. Approaching From Another Side > Breakdown source of latency in

    write(2)'s call-graph using another SystemTap script *1 vfs_write() do_sync_write() ext4_file_write() rw_verify_area() probe kernel.function("vfs_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function("vfs_write").return { duration = gettimeofday_us() - time_entry[tid()] ... *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py
  42. Adding Suspicious Part One-by-One > … 14 attempts in total.

  43. Found It’s Waiting on Lock sys_write took 3890322 us in

    total:, find_get_page=1, __lock_page=3890280, find_lock_page=3890290, grab_cache_page_write_begin=3890294, vfs_write=3890317
  44. Analyze Lock Contention > Hooking lock acquisition/release points: `__lock_page` and

    `unlock_page` > Figured out that it's cascading issue Application Calling write(2) Writeback Task ext4’s Journaling Daemon Block Waiting To Lock Page Waiting Buffer to
 Become Writable
  45. Issue Overview

  46. Overview - GC Logging Inside Safepoint

  47. Overview - Writeback Task Blocked by Journaling

  48. Overview - Journaling Task Blocked by Slow Response From Disk

  49. Now Ready To Think Solution > Use filesystem other than

    ext4 or disable journaling > Use tmpfs as 1st storage to write GC log with realtime persistent backup > => Both could solved the problem
  50. Summary > Our definition of Reliability Engineering covers wide range

    of responsibility > SLI measurement • Client’s (user’s) viewpoint for SLI measurement in distributed systems > Troubleshooting / Optimization • Fundamental solution from Fundamental problem understanding • Accumulating knowledge • KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block
  51. End of Presentation