Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability Engineering Behind The Most Trusted...

Reliability Engineering Behind The Most Trusted Kafka Platform

Yuto Kawamura
LINE Z Part Team Software Engineer
https://linedevday.linecorp.com/jp/2019/sessions/A1-4

Avatar for LINE DevDay 2019

LINE DevDay 2019

November 20, 2019
Tweet

More Decks by LINE DevDay 2019

Other Decks in Technology

Transcript

  1. 2019 DevDay Reliability Engineering Behind the Most Trusted Kafka Platform

    > Yuto Kawamura > LINE Z Part Team Software Engineer
  2. Speaker > Yuto Kawamura > Senior Software Engineer @ Z

    Part > Lead of Kafka Platform team > ❤ Performance Engineering > Apache Kafka Contributor > Past speaking • Kafka Summit 2017/2018 • LINE DEVELOPER DAY 2018
  3. > ❌ How we use Kafka (or How great Kafka

    is) Today’s Message > ❌ Best Practices for Using Kafka > ✅ Our Reliability Engineering Philosophy, Applicable Techniques of Reliability Engineering
  4. > High-scalability/High-performance distributed system > Default data persistency > Streaming

    Data Middleware Apache Kafka > Pub-Sub model for distributing data > Written in Scala, run on JVM
  5. > Background asynchronous task processing > Event propagation among services

    Typical Usage Service A Service B Service C Service A WebApp Service A
 Processor Queueing Pub-Sub
  6. Daily Inflow Messages 360 Billion User Services Over 60 Daily

    IN+OUT Data 550 TB Kafka Is One of Most Major Middleware at LINE
  7. > SLO Visualization Reliability Engineering Kinds > Automation • Workflow

    consistency • Free us from bothering work to concentrate on advanced works > Architecture Proposal • Provide library • Demonstrate Architecture for higher system availability/performance > Troubleshooting and Preventive Engineering
  8. > API response time • For latency > API error

    rate • For availability SLI To Measure SLO > SLI = Service Level Indicator • A measurable value which represents the service performance directly > Normally measured in server (broker) side
  9. > Metrics from servers are not enough SLI Measurement in

    Distributed Systems > Message Delivery Latency through Kafka = • Response time for Producer request • + Time taken for fail-over, retries, metadata update • + Response time for Consumer request • + Time taken for Consumer group coordination > Client library works cooperatively to server clusters for higher availability • retry • failover
  10. > We can’t use metrics from actual service clients •

    Business logic affects measurements • We (platform provider) must define the right model (config, usage) of “clients” Measure SLI From Client’s Viewpoint > Measure performance of entire system/flow • What clients actually see after all attempts inside client library
  11. > Measure Performance of cluster from client (user)’s viewpoint Prober

    > A dedicated Kafka clients to measure performance of entire stack > Run Distributed with multiple instances to • Simulate realistic usage of Kafka clients • Latency by fail-over • Distributed partitions assignment • To not lost metrics during restart (rolling restart)
  12. Prober - Architecture Redis Prober Prober Kafka Cluster Broker 1

    Broker 1 Broker 1 Broker 2 Broker 1 Broker 3 Writer Reader Reporter 1. Produce 2. Report Write 3. Consume 4. Report Read Failover Duration
  13. Redis Prober - Architecture Prober Writer Reader Reporter W2P W1P

    W3P … W2C W1C W3C … Write Buckets Read Buckets W1P - W2C
 ————— = Loss Rate W1P Produce Consume Prometheus Metric Time
  14. Prober - Scenario Configuration > Configure Multiple “Scenarios” • Scenario

    = Client Impl x Client Config x Throughput/Latency Config • Measure Consistent Metrics for different setup / different requirement > Add 15 lines of YAML, get all measurements
  15. Despite of our Engineering Effort… > Trouble happens • Again,

    large scale + high service level requirement = Trouble Factory > What makes the last difference is attitude against troubles difficult to solve • Try to clear out essential root cause? • Or just workaround and forget?
  16. Fundamental Solution Making > Our Team’s philosophy is to always

    persuade “Fundamental Solution” • E.g, given GC program, • Try changing JVM options around and see what happens? • Or do full-inspection to clear out why GC’s taking so long? > Fundamental Solution Making is better than workaround • Accumulates deep knowledge inside us • Applicable for similar but different issues in the same domain • Leaves improvements permanently
  17. Symptom > API Response time degradation in 99th %ile •

    Violates SLO! > Drop out from cluster by Zookeeper session timeout • If *2 then OUTAGE!
  18. Observation - Thread Utilization Increase > Graph is showing threads

    "idle ratio" so util is reverse > Equally for all threads • Request handler threads • Network threads
  19. Observation - JVM Pause Time Increase 2019-05-13T01:42:44.936+0900: 1423318.149: Total Time

    for Which Application Threads Were Stopped: 1.1904424 Seconds, Stopping Threads Took: 1.1883390 Seconds
  20. Study - “Stopping Threads” Rarely Increase http://psy-lob-saw.blogspot.com/2015/12/safepoints.html > "safepoint" :

    Moments in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc…) > "safepoint sync" : Moments that JVM is waiting all threads to voluntarily enter pause • => This is the increasing part
  21. Guess What’s Going On > Long-running safepoint sync > +

    Slow response time from disks > = JVM performs slow IO inside safepoint sync/operation ...? > In fact JVM writes logs on every safepoint occurrence
  22. Study - Does write(2) System Call Blocks? > False for

    the most of the cases > write(2) just enables "dirty flag" of the page > Kernel's background task periodically "writeback" dirty pages into block devices > In case calling write(2) is just about copying memory write(2) System Call Page Page Page Writeback Task Disk
  23. Measure write(2) System Call’s Duration > Using SystemTap script *1

    *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp > SystemTap : dynamic tracing tool • Enables to inspect events inside the Linux kernel • APIs for scripting probe kernel.function("sys_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function(“sys_write").return { duration = gettimeofday_us() - time_entry[tid()] ...
  24. Measure write(2) System Call’s Duration Sun May 19 00:32:03 2019,724

    SYSCALL write from PID/TID 136422/136489 for FILE (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel] Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel] 0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so] 0x331fc71943 : _IO_file_write@@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so] 0x331fc72ef5 : _IO_do_write@@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so] 0x331fc720f8 : _IO_file_sync@@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so] 0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so] 0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] (snip) 0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so]
  25. So the Premise Has Broken > Even write(2) system calls

    for appending file content blocks in some situation > Then are we ready to start considering solution?
  26. > We don't know yet why write(2) can block occasionally

    > Without knowing what's going on underneath, we might take wrong option > => Try stack sampling when experiencing slow write(2) Fundamental Solution Born From Fundamental Understanding
  27. However… > We had to service out the broker from

    problematic machine because performance was getting worse daily > The problematic machine's partially-broken HDD has been replaced accidentally by maintenance • I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)" #
  28. Lost Reproducible Environment but… > Don’t give up! > Two

    things to reproduce : • Write workload by Kafka broker • Slowing down disk device
  29. Write Workload by Kafka Broker > Tiny C program with

    measuring write(2) call duration
  30. Slowing Down Disk Device > Disk slowing down != write(2)

    slowing down • We're now investigating why write(2) slows down > Attempted to hook `generic_make_request` and inject delay to simulate slow disk with SystemTap • but failed... well SystemTap is designed for safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 … ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5]
  31. $Use kprobe Writing Kernel Module > kprobe : Linux kernel

    API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses it as the backend) * https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c
  32. Successfully Reproduced! > Found that `[jbd2/sda3-8]` kernel task (ext4's journaling

    daemon) is blocked > but couldn't explain why it affects `write(2)`'s processing time… Pid: 4390, comm: jbd2/sda3-8 Not tainted ... Call Trace: … [<ffffffffxxxxxxxx>] ? generic_make_request+0x1/0x5a0 <<EOE>> [<ffffffffxxxxxxxx>] ? submit_bio+0x70/0x120 [<ffffffffxxxxxxxx>] ? submit_bh+0x11d/0x1f0 [<ffffffffxxxxxxxx>] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2] … [<ffffffffxxxxxxxx>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffffxxxxxxxx>] ? kthread+0x9e/0xc0
  33. Approaching From Another Side > Breakdown source of latency in

    write(2)'s call-graph using another SystemTap script *1 vfs_write() do_sync_write() ext4_file_write() rw_verify_area() probe kernel.function("vfs_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function("vfs_write").return { duration = gettimeofday_us() - time_entry[tid()] ... *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py
  34. Found It’s Waiting on Lock sys_write took 3890322 us in

    total:, find_get_page=1, __lock_page=3890280, find_lock_page=3890290, grab_cache_page_write_begin=3890294, vfs_write=3890317
  35. Analyze Lock Contention > Hooking lock acquisition/release points: `__lock_page` and

    `unlock_page` > Figured out that it's cascading issue Application Calling write(2) Writeback Task ext4’s Journaling Daemon Block Waiting To Lock Page Waiting Buffer to
 Become Writable
  36. Now Ready To Think Solution > Use filesystem other than

    ext4 or disable journaling > Use tmpfs as 1st storage to write GC log with realtime persistent backup > => Both could solved the problem
  37. Summary > Our definition of Reliability Engineering covers wide range

    of responsibility > SLI measurement • Client’s (user’s) viewpoint for SLI measurement in distributed systems > Troubleshooting / Optimization • Fundamental solution from Fundamental problem understanding • Accumulating knowledge • KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block