Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability Engineering Behind The Most Trusted Kafka Platform

Reliability Engineering Behind The Most Trusted Kafka Platform

Yuto Kawamura
LINE Z Part Team Software Engineer
https://linedevday.linecorp.com/jp/2019/sessions/A1-4

LINE DevDay 2019

November 20, 2019
Tweet

More Decks by LINE DevDay 2019

Other Decks in Technology

Transcript

  1. 2019 DevDay
    Reliability Engineering Behind the
    Most Trusted Kafka Platform
    > Yuto Kawamura
    > LINE Z Part Team Software Engineer

    View full-size slide

  2. Speaker
    > Yuto Kawamura
    > Senior Software Engineer @ Z Part
    > Lead of Kafka Platform team
    > ❤ Performance Engineering
    > Apache Kafka Contributor
    > Past speaking
    • Kafka Summit 2017/2018
    • LINE DEVELOPER DAY 2018

    View full-size slide

  3. Apache Kafka At LINE

    View full-size slide

  4. > ❌ How we use Kafka (or How great Kafka is)
    Today’s Message
    > ❌ Best Practices for Using Kafka
    > ✅ Our Reliability Engineering Philosophy, Applicable Techniques of
    Reliability Engineering

    View full-size slide

  5. > High-scalability/High-performance
    distributed system
    > Default data persistency
    > Streaming Data Middleware
    Apache Kafka
    > Pub-Sub model for distributing data
    > Written in Scala, run on JVM

    View full-size slide

  6. > Background asynchronous task
    processing
    > Event propagation among services
    Typical Usage
    Service A
    Service B
    Service C
    Service A
    WebApp
    Service A

    Processor
    Queueing
    Pub-Sub

    View full-size slide

  7. Daily Inflow Messages
    360 Billion
    User Services
    Over 60
    Daily IN+OUT Data
    550 TB
    Kafka Is One of Most Major Middleware at LINE

    View full-size slide

  8. Service Level Objective (SLO)
    Cluster Availability
    99.999%
    99th Percentile
    Response Time
    < 40ms

    View full-size slide

  9. Challenges Combined
    360 Billion / Day
    Traffic Scale
    < 40ms Response
    Time
    SLO Extreme
    Challenge

    View full-size slide

  10. > SLO Visualization
    Reliability Engineering Kinds
    > Automation
    • Workflow consistency
    • Free us from bothering work to concentrate on advanced works
    > Architecture Proposal
    • Provide library
    • Demonstrate Architecture for higher system availability/performance
    > Troubleshooting and Preventive Engineering

    View full-size slide

  11. SLO Visualization

    View full-size slide

  12. > API response time
    • For latency
    > API error rate
    • For availability
    SLI To Measure SLO
    > SLI = Service Level Indicator
    • A measurable value which represents the service performance directly
    > Normally measured in server (broker) side

    View full-size slide

  13. > Metrics from servers are not enough
    SLI Measurement in Distributed Systems
    > Message Delivery Latency through Kafka =
    • Response time for Producer request
    • + Time taken for fail-over, retries, metadata update
    • + Response time for Consumer request
    • + Time taken for Consumer group coordination
    > Client library works cooperatively to server clusters for higher availability
    • retry
    • failover

    View full-size slide

  14. > We can’t use metrics from actual service clients
    • Business logic affects measurements
    • We (platform provider) must define the right model (config, usage) of
    “clients”
    Measure SLI From Client’s Viewpoint
    > Measure performance of entire system/flow
    • What clients actually see after all attempts inside client library

    View full-size slide

  15. > Measure Performance of cluster from client (user)’s viewpoint
    Prober
    > A dedicated Kafka clients to measure performance of entire stack
    > Run Distributed with multiple instances to
    • Simulate realistic usage of Kafka clients
    • Latency by fail-over
    • Distributed partitions assignment
    • To not lost metrics during restart (rolling restart)

    View full-size slide

  16. Prober - Architecture
    Redis
    Prober
    Prober
    Kafka Cluster
    Broker 1
    Broker 1
    Broker 1
    Broker 2
    Broker 1
    Broker 3
    Writer
    Reader
    Reporter
    1. Produce
    2. Report Write
    3. Consume
    4. Report Read
    Failover Duration

    View full-size slide

  17. Redis
    Prober - Architecture
    Prober
    Writer
    Reader
    Reporter
    W2P
    W1P W3P …
    W2C
    W1C W3C …
    Write Buckets
    Read Buckets
    W1P - W2C

    ————— = Loss Rate
    W1P
    Produce
    Consume
    Prometheus Metric
    Time

    View full-size slide

  18. Prober - Scenario Configuration
    > Configure Multiple “Scenarios”
    • Scenario = Client Impl x Client Config x
    Throughput/Latency Config
    • Measure Consistent Metrics for
    different setup / different requirement
    > Add 15 lines of YAML, get all
    measurements

    View full-size slide

  19. SLI Dashboard

    View full-size slide

  20. Troubleshooting

    View full-size slide

  21. Despite of our Engineering Effort…
    > Trouble happens
    • Again, large scale + high service level requirement = Trouble Factory
    > What makes the last difference is attitude against troubles difficult to solve
    • Try to clear out essential root cause?
    • Or just workaround and forget?

    View full-size slide

  22. Fundamental Solution Making
    > Our Team’s philosophy is to always persuade “Fundamental Solution”
    • E.g, given GC program,
    • Try changing JVM options around and see what happens?
    • Or do full-inspection to clear out why GC’s taking so long?
    > Fundamental Solution Making is better than workaround
    • Accumulates deep knowledge inside us
    • Applicable for similar but different issues in the same domain
    • Leaves improvements permanently

    View full-size slide

  23. Troubleshooting - an Example

    View full-size slide

  24. Symptom
    > API Response time degradation in 99th %ile
    • Violates SLO!
    > Drop out from cluster by Zookeeper session timeout
    • If *2 then OUTAGE!

    View full-size slide

  25. Observation - Thread Utilization Increase
    > Graph is showing threads "idle ratio" so util is reverse
    > Equally for all threads
    • Request handler threads
    • Network threads

    View full-size slide

  26. Observation - JVM Pause Time Increase
    2019-05-13T01:42:44.936+0900: 1423318.149: Total Time for Which
    Application Threads Were Stopped: 1.1904424 Seconds, Stopping Threads
    Took: 1.1883390 Seconds

    View full-size slide

  27. Study - “Stopping Threads” Rarely Increase
    http://psy-lob-saw.blogspot.com/2015/12/safepoints.html
    > "safepoint" : Moments in JVM that it stops all
    app threads to safely execute unsafe ops
    (GC, jstack, etc…)
    > "safepoint sync" : Moments that JVM is
    waiting all threads to voluntarily enter pause
    • => This is the increasing part

    View full-size slide

  28. Observation - Disk Access Time Increase
    > At the same time we saw the symptom

    View full-size slide

  29. Guess What’s Going On
    > Long-running safepoint sync
    > + Slow response time from disks
    > = JVM performs slow IO inside safepoint sync/operation ...?
    > In fact JVM writes logs on every safepoint occurrence

    View full-size slide

  30. Study - Does write(2) System Call Blocks?
    > False for the most of the cases
    > write(2) just enables "dirty flag" of the page
    > Kernel's background task periodically "writeback"
    dirty pages into block devices
    > In case calling write(2) is just about copying
    memory
    write(2) System Call
    Page Page Page
    Writeback Task
    Disk

    View full-size slide

  31. Measure write(2) System Call’s Duration
    > Using SystemTap script *1
    *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp
    > SystemTap : dynamic tracing tool
    • Enables to inspect events inside
    the Linux kernel
    • APIs for scripting
    probe kernel.function("sys_write") {
    time_entry[tid()] = gettimeofday_us()
    ...
    probe kernel.function(“sys_write").return {
    duration = gettimeofday_us() - time_entry[tid()]
    ...

    View full-size slide

  32. Measure write(2) System Call’s Duration
    Sun May 19 00:32:03 2019,724 SYSCALL write from PID/TID 136422/136489 for FILE
    (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms
    Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel]
    Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel]
    0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so]
    0x331fc71943 : _IO_file_write@@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so]
    0x331fc72ef5 : _IO_do_write@@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so]
    0x331fc720f8 : _IO_file_sync@@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so]
    0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so]
    0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure
    P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0
    [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so]
    0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750
    [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so]
    (snip)
    0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so]
    0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so]
    0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so]
    0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so]
    0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so]

    View full-size slide

  33. So the Premise Has Broken
    > Even write(2) system calls for appending file content blocks in some
    situation
    > Then are we ready to start considering solution?

    View full-size slide

  34. > We don't know yet why write(2) can block occasionally
    > Without knowing what's going on underneath, we might take wrong option
    > => Try stack sampling when experiencing slow write(2)
    Fundamental Solution Born From
    Fundamental Understanding

    View full-size slide

  35. However…
    > We had to service out the broker from problematic machine because
    performance was getting worse daily
    > The problematic machine's partially-broken HDD has been replaced
    accidentally by maintenance
    • I made a kernel crash => Ask hard reboot => "Replaced broken HDD at
    that same time :)" #

    View full-size slide

  36. Lost Reproducible Environment but…
    > Don’t give up!
    > Two things to reproduce :
    • Write workload by Kafka broker
    • Slowing down disk device

    View full-size slide

  37. Write Workload by Kafka Broker
    > Tiny C program with measuring write(2) call duration

    View full-size slide

  38. Slowing Down Disk Device
    > Disk slowing down != write(2) slowing down
    • We're now investigating why write(2) slows down
    > Attempted to hook `generic_make_request` and inject delay to simulate
    slow disk with SystemTap
    • but failed... well SystemTap is designed for safe-overhead execution
    $ stap ... -e 'probe ioblock.request {
    for (i = 0; i < 2147483647; i++) {
    for (j = 0; j < 2147483647; j++)
    { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1

    ERROR: MAXACTION exceeded near identifier 'target' at
    /usr/share/systemtap/tapset/linux/context.stp:380:10
    Pass 5: run completed in 0usr/30sys/914real ms.
    Pass 5: run failed. [man error::pass5]

    View full-size slide

  39. $Use kprobe Writing Kernel Module
    > kprobe : Linux kernel API that provides capability to hook arbitrary function's
    enter/return and insert code (SystemTap uses it as the backend)
    * https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c

    View full-size slide

  40. Successfully Reproduced!
    > Found that `[jbd2/sda3-8]` kernel task (ext4's journaling daemon) is blocked
    > but couldn't explain why it affects `write(2)`'s processing time…
    Pid: 4390, comm: jbd2/sda3-8 Not tainted ...
    Call Trace:

    [] ? generic_make_request+0x1/0x5a0
    <> [] ? submit_bio+0x70/0x120
    [] ? submit_bh+0x11d/0x1f0
    [] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2]

    [] ? kjournald2+0x0/0x220 [jbd2]
    [] ? kthread+0x9e/0xc0

    View full-size slide

  41. Approaching From Another Side
    > Breakdown source of latency in
    write(2)'s call-graph using another
    SystemTap script *1
    vfs_write()
    do_sync_write()
    ext4_file_write()
    rw_verify_area()
    probe kernel.function("vfs_write") {
    time_entry[tid()] = gettimeofday_us()
    ...
    probe kernel.function("vfs_write").return {
    duration = gettimeofday_us() - time_entry[tid()]
    ...
    *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py

    View full-size slide

  42. Adding Suspicious Part One-by-One
    > … 14 attempts in total.

    View full-size slide

  43. Found It’s Waiting on Lock
    sys_write took 3890322 us in total:,
    find_get_page=1,
    __lock_page=3890280,
    find_lock_page=3890290,
    grab_cache_page_write_begin=3890294,
    vfs_write=3890317

    View full-size slide

  44. Analyze Lock Contention
    > Hooking lock acquisition/release points: `__lock_page` and `unlock_page`
    > Figured out that it's cascading issue
    Application Calling
    write(2) Writeback Task ext4’s Journaling
    Daemon
    Block
    Waiting To Lock Page
    Waiting Buffer to

    Become Writable

    View full-size slide

  45. Issue Overview

    View full-size slide

  46. Overview - GC Logging Inside Safepoint

    View full-size slide

  47. Overview - Writeback Task Blocked by
    Journaling

    View full-size slide

  48. Overview - Journaling Task Blocked by Slow
    Response From Disk

    View full-size slide

  49. Now Ready To Think Solution
    > Use filesystem other than ext4 or disable journaling
    > Use tmpfs as 1st storage to write GC log with realtime persistent backup
    > => Both could solved the problem

    View full-size slide

  50. Summary
    > Our definition of Reliability Engineering covers wide range of responsibility
    > SLI measurement
    • Client’s (user’s) viewpoint for SLI measurement in distributed systems
    > Troubleshooting / Optimization
    • Fundamental solution from Fundamental problem understanding
    • Accumulating knowledge
    • KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block

    View full-size slide

  51. End of Presentation

    View full-size slide