Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability Engineering Behind The Most Trusted Kafka Platform

Reliability Engineering Behind The Most Trusted Kafka Platform

Yuto Kawamura
LINE Z Part Team Software Engineer

LINE DevDay 2019

November 20, 2019

More Decks by LINE DevDay 2019

Other Decks in Technology


  1. 2019 DevDay Reliability Engineering Behind the Most Trusted Kafka Platform

    > Yuto Kawamura > LINE Z Part Team Software Engineer
  2. Speaker > Yuto Kawamura > Senior Software Engineer @ Z

    Part > Lead of Kafka Platform team > ❤ Performance Engineering > Apache Kafka Contributor > Past speaking • Kafka Summit 2017/2018 • LINE DEVELOPER DAY 2018
  3. Apache Kafka At LINE

  4. > ❌ How we use Kafka (or How great Kafka

    is) Today’s Message > ❌ Best Practices for Using Kafka > ✅ Our Reliability Engineering Philosophy, Applicable Techniques of Reliability Engineering
  5. > High-scalability/High-performance distributed system > Default data persistency > Streaming

    Data Middleware Apache Kafka > Pub-Sub model for distributing data > Written in Scala, run on JVM
  6. > Background asynchronous task processing > Event propagation among services

    Typical Usage Service A Service B Service C Service A WebApp Service A
 Processor Queueing Pub-Sub
  7. Daily Inflow Messages 360 Billion User Services Over 60 Daily

    IN+OUT Data 550 TB Kafka Is One of Most Major Middleware at LINE
  8. Service Level Objective (SLO) Cluster Availability 99.999% 99th Percentile Response

    Time < 40ms
  9. Challenges Combined 360 Billion / Day Traffic Scale < 40ms

    Response Time SLO Extreme Challenge
  10. > SLO Visualization Reliability Engineering Kinds > Automation • Workflow

    consistency • Free us from bothering work to concentrate on advanced works > Architecture Proposal • Provide library • Demonstrate Architecture for higher system availability/performance > Troubleshooting and Preventive Engineering
  11. SLO Visualization

  12. > API response time • For latency > API error

    rate • For availability SLI To Measure SLO > SLI = Service Level Indicator • A measurable value which represents the service performance directly > Normally measured in server (broker) side
  13. > Metrics from servers are not enough SLI Measurement in

    Distributed Systems > Message Delivery Latency through Kafka = • Response time for Producer request • + Time taken for fail-over, retries, metadata update • + Response time for Consumer request • + Time taken for Consumer group coordination > Client library works cooperatively to server clusters for higher availability • retry • failover
  14. > We can’t use metrics from actual service clients •

    Business logic affects measurements • We (platform provider) must define the right model (config, usage) of “clients” Measure SLI From Client’s Viewpoint > Measure performance of entire system/flow • What clients actually see after all attempts inside client library
  15. > Measure Performance of cluster from client (user)’s viewpoint Prober

    > A dedicated Kafka clients to measure performance of entire stack > Run Distributed with multiple instances to • Simulate realistic usage of Kafka clients • Latency by fail-over • Distributed partitions assignment • To not lost metrics during restart (rolling restart)
  16. Prober - Architecture Redis Prober Prober Kafka Cluster Broker 1

    Broker 1 Broker 1 Broker 2 Broker 1 Broker 3 Writer Reader Reporter 1. Produce 2. Report Write 3. Consume 4. Report Read Failover Duration
  17. Redis Prober - Architecture Prober Writer Reader Reporter W2P W1P

    W3P … W2C W1C W3C … Write Buckets Read Buckets W1P - W2C
 ————— = Loss Rate W1P Produce Consume Prometheus Metric Time
  18. Prober - Scenario Configuration > Configure Multiple “Scenarios” • Scenario

    = Client Impl x Client Config x Throughput/Latency Config • Measure Consistent Metrics for different setup / different requirement > Add 15 lines of YAML, get all measurements
  19. SLI Dashboard

  20. Troubleshooting

  21. Despite of our Engineering Effort… > Trouble happens • Again,

    large scale + high service level requirement = Trouble Factory > What makes the last difference is attitude against troubles difficult to solve • Try to clear out essential root cause? • Or just workaround and forget?
  22. Fundamental Solution Making > Our Team’s philosophy is to always

    persuade “Fundamental Solution” • E.g, given GC program, • Try changing JVM options around and see what happens? • Or do full-inspection to clear out why GC’s taking so long? > Fundamental Solution Making is better than workaround • Accumulates deep knowledge inside us • Applicable for similar but different issues in the same domain • Leaves improvements permanently
  23. Troubleshooting - an Example

  24. Symptom > API Response time degradation in 99th %ile •

    Violates SLO! > Drop out from cluster by Zookeeper session timeout • If *2 then OUTAGE!
  25. Observation - Thread Utilization Increase > Graph is showing threads

    "idle ratio" so util is reverse > Equally for all threads • Request handler threads • Network threads
  26. Observation - JVM Pause Time Increase 2019-05-13T01:42:44.936+0900: 1423318.149: Total Time

    for Which Application Threads Were Stopped: 1.1904424 Seconds, Stopping Threads Took: 1.1883390 Seconds
  27. Study - “Stopping Threads” Rarely Increase http://psy-lob-saw.blogspot.com/2015/12/safepoints.html > "safepoint" :

    Moments in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc…) > "safepoint sync" : Moments that JVM is waiting all threads to voluntarily enter pause • => This is the increasing part
  28. Observation - Disk Access Time Increase > At the same

    time we saw the symptom
  29. Guess What’s Going On > Long-running safepoint sync > +

    Slow response time from disks > = JVM performs slow IO inside safepoint sync/operation ...? > In fact JVM writes logs on every safepoint occurrence
  30. Study - Does write(2) System Call Blocks? > False for

    the most of the cases > write(2) just enables "dirty flag" of the page > Kernel's background task periodically "writeback" dirty pages into block devices > In case calling write(2) is just about copying memory write(2) System Call Page Page Page Writeback Task Disk
  31. Measure write(2) System Call’s Duration > Using SystemTap script *1

    *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp > SystemTap : dynamic tracing tool • Enables to inspect events inside the Linux kernel • APIs for scripting probe kernel.function("sys_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function(“sys_write").return { duration = gettimeofday_us() - time_entry[tid()] ...
  32. Measure write(2) System Call’s Duration Sun May 19 00:32:03 2019,724

    SYSCALL write from PID/TID 136422/136489 for FILE (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel] Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel] 0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so] 0x331fc71943 : [email protected]@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so] 0x331fc72ef5 : [email protected]@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so] 0x331fc720f8 : [email protected]@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so] 0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so] 0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] (snip) 0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so]
  33. So the Premise Has Broken > Even write(2) system calls

    for appending file content blocks in some situation > Then are we ready to start considering solution?
  34. > We don't know yet why write(2) can block occasionally

    > Without knowing what's going on underneath, we might take wrong option > => Try stack sampling when experiencing slow write(2) Fundamental Solution Born From Fundamental Understanding
  35. However… > We had to service out the broker from

    problematic machine because performance was getting worse daily > The problematic machine's partially-broken HDD has been replaced accidentally by maintenance • I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)" #
  36. Lost Reproducible Environment but… > Don’t give up! > Two

    things to reproduce : • Write workload by Kafka broker • Slowing down disk device
  37. Write Workload by Kafka Broker > Tiny C program with

    measuring write(2) call duration
  38. Slowing Down Disk Device > Disk slowing down != write(2)

    slowing down • We're now investigating why write(2) slows down > Attempted to hook `generic_make_request` and inject delay to simulate slow disk with SystemTap • but failed... well SystemTap is designed for safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 … ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5]
  39. $Use kprobe Writing Kernel Module > kprobe : Linux kernel

    API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses it as the backend) * https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c
  40. Successfully Reproduced! > Found that `[jbd2/sda3-8]` kernel task (ext4's journaling

    daemon) is blocked > but couldn't explain why it affects `write(2)`'s processing time… Pid: 4390, comm: jbd2/sda3-8 Not tainted ... Call Trace: … [<ffffffffxxxxxxxx>] ? generic_make_request+0x1/0x5a0 <<EOE>> [<ffffffffxxxxxxxx>] ? submit_bio+0x70/0x120 [<ffffffffxxxxxxxx>] ? submit_bh+0x11d/0x1f0 [<ffffffffxxxxxxxx>] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2] … [<ffffffffxxxxxxxx>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffffxxxxxxxx>] ? kthread+0x9e/0xc0
  41. Approaching From Another Side > Breakdown source of latency in

    write(2)'s call-graph using another SystemTap script *1 vfs_write() do_sync_write() ext4_file_write() rw_verify_area() probe kernel.function("vfs_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function("vfs_write").return { duration = gettimeofday_us() - time_entry[tid()] ... *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py
  42. Adding Suspicious Part One-by-One > … 14 attempts in total.

  43. Found It’s Waiting on Lock sys_write took 3890322 us in

    total:, find_get_page=1, __lock_page=3890280, find_lock_page=3890290, grab_cache_page_write_begin=3890294, vfs_write=3890317
  44. Analyze Lock Contention > Hooking lock acquisition/release points: `__lock_page` and

    `unlock_page` > Figured out that it's cascading issue Application Calling write(2) Writeback Task ext4’s Journaling Daemon Block Waiting To Lock Page Waiting Buffer to
 Become Writable
  45. Issue Overview

  46. Overview - GC Logging Inside Safepoint

  47. Overview - Writeback Task Blocked by Journaling

  48. Overview - Journaling Task Blocked by Slow Response From Disk

  49. Now Ready To Think Solution > Use filesystem other than

    ext4 or disable journaling > Use tmpfs as 1st storage to write GC log with realtime persistent backup > => Both could solved the problem
  50. Summary > Our definition of Reliability Engineering covers wide range

    of responsibility > SLI measurement • Client’s (user’s) viewpoint for SLI measurement in distributed systems > Troubleshooting / Optimization • Fundamental solution from Fundamental problem understanding • Accumulating knowledge • KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block
  51. End of Presentation