2019 DevDay
Reliability Engineering Behind the
Most Trusted Kafka Platform
> Yuto Kawamura
> LINE Z Part Team Software Engineer
Slide 2
Slide 2 text
Speaker
> Yuto Kawamura
> Senior Software Engineer @ Z Part
> Lead of Kafka Platform team
> ❤ Performance Engineering
> Apache Kafka Contributor
> Past speaking
• Kafka Summit 2017/2018
• LINE DEVELOPER DAY 2018
Slide 3
Slide 3 text
Apache Kafka At LINE
Slide 4
Slide 4 text
> ❌ How we use Kafka (or How great Kafka is)
Today’s Message
> ❌ Best Practices for Using Kafka
> ✅ Our Reliability Engineering Philosophy, Applicable Techniques of
Reliability Engineering
Slide 5
Slide 5 text
> High-scalability/High-performance
distributed system
> Default data persistency
> Streaming Data Middleware
Apache Kafka
> Pub-Sub model for distributing data
> Written in Scala, run on JVM
Slide 6
Slide 6 text
> Background asynchronous task
processing
> Event propagation among services
Typical Usage
Service A
Service B
Service C
Service A
WebApp
Service A
Processor
Queueing
Pub-Sub
Slide 7
Slide 7 text
Daily Inflow Messages
360 Billion
User Services
Over 60
Daily IN+OUT Data
550 TB
Kafka Is One of Most Major Middleware at LINE
Slide 8
Slide 8 text
Service Level Objective (SLO)
Cluster Availability
99.999%
99th Percentile
Response Time
< 40ms
Slide 9
Slide 9 text
Challenges Combined
360 Billion / Day
Traffic Scale
< 40ms Response
Time
SLO Extreme
Challenge
Slide 10
Slide 10 text
> SLO Visualization
Reliability Engineering Kinds
> Automation
• Workflow consistency
• Free us from bothering work to concentrate on advanced works
> Architecture Proposal
• Provide library
• Demonstrate Architecture for higher system availability/performance
> Troubleshooting and Preventive Engineering
Slide 11
Slide 11 text
SLO Visualization
Slide 12
Slide 12 text
> API response time
• For latency
> API error rate
• For availability
SLI To Measure SLO
> SLI = Service Level Indicator
• A measurable value which represents the service performance directly
> Normally measured in server (broker) side
Slide 13
Slide 13 text
> Metrics from servers are not enough
SLI Measurement in Distributed Systems
> Message Delivery Latency through Kafka =
• Response time for Producer request
• + Time taken for fail-over, retries, metadata update
• + Response time for Consumer request
• + Time taken for Consumer group coordination
> Client library works cooperatively to server clusters for higher availability
• retry
• failover
Slide 14
Slide 14 text
> We can’t use metrics from actual service clients
• Business logic affects measurements
• We (platform provider) must define the right model (config, usage) of
“clients”
Measure SLI From Client’s Viewpoint
> Measure performance of entire system/flow
• What clients actually see after all attempts inside client library
Slide 15
Slide 15 text
> Measure Performance of cluster from client (user)’s viewpoint
Prober
> A dedicated Kafka clients to measure performance of entire stack
> Run Distributed with multiple instances to
• Simulate realistic usage of Kafka clients
• Latency by fail-over
• Distributed partitions assignment
• To not lost metrics during restart (rolling restart)
Prober - Scenario Configuration
> Configure Multiple “Scenarios”
• Scenario = Client Impl x Client Config x
Throughput/Latency Config
• Measure Consistent Metrics for
different setup / different requirement
> Add 15 lines of YAML, get all
measurements
Slide 19
Slide 19 text
SLI Dashboard
Slide 20
Slide 20 text
Troubleshooting
Slide 21
Slide 21 text
Despite of our Engineering Effort…
> Trouble happens
• Again, large scale + high service level requirement = Trouble Factory
> What makes the last difference is attitude against troubles difficult to solve
• Try to clear out essential root cause?
• Or just workaround and forget?
Slide 22
Slide 22 text
Fundamental Solution Making
> Our Team’s philosophy is to always persuade “Fundamental Solution”
• E.g, given GC program,
• Try changing JVM options around and see what happens?
• Or do full-inspection to clear out why GC’s taking so long?
> Fundamental Solution Making is better than workaround
• Accumulates deep knowledge inside us
• Applicable for similar but different issues in the same domain
• Leaves improvements permanently
Slide 23
Slide 23 text
Troubleshooting - an Example
Slide 24
Slide 24 text
Symptom
> API Response time degradation in 99th %ile
• Violates SLO!
> Drop out from cluster by Zookeeper session timeout
• If *2 then OUTAGE!
Slide 25
Slide 25 text
Observation - Thread Utilization Increase
> Graph is showing threads "idle ratio" so util is reverse
> Equally for all threads
• Request handler threads
• Network threads
Slide 26
Slide 26 text
Observation - JVM Pause Time Increase
2019-05-13T01:42:44.936+0900: 1423318.149: Total Time for Which
Application Threads Were Stopped: 1.1904424 Seconds, Stopping Threads
Took: 1.1883390 Seconds
Slide 27
Slide 27 text
Study - “Stopping Threads” Rarely Increase
http://psy-lob-saw.blogspot.com/2015/12/safepoints.html
> "safepoint" : Moments in JVM that it stops all
app threads to safely execute unsafe ops
(GC, jstack, etc…)
> "safepoint sync" : Moments that JVM is
waiting all threads to voluntarily enter pause
• => This is the increasing part
Slide 28
Slide 28 text
Observation - Disk Access Time Increase
> At the same time we saw the symptom
Slide 29
Slide 29 text
Guess What’s Going On
> Long-running safepoint sync
> + Slow response time from disks
> = JVM performs slow IO inside safepoint sync/operation ...?
> In fact JVM writes logs on every safepoint occurrence
Slide 30
Slide 30 text
Study - Does write(2) System Call Blocks?
> False for the most of the cases
> write(2) just enables "dirty flag" of the page
> Kernel's background task periodically "writeback"
dirty pages into block devices
> In case calling write(2) is just about copying
memory
write(2) System Call
Page Page Page
Writeback Task
Disk
Slide 31
Slide 31 text
Measure write(2) System Call’s Duration
> Using SystemTap script *1
*1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp
> SystemTap : dynamic tracing tool
• Enables to inspect events inside
the Linux kernel
• APIs for scripting
probe kernel.function("sys_write") {
time_entry[tid()] = gettimeofday_us()
...
probe kernel.function(“sys_write").return {
duration = gettimeofday_us() - time_entry[tid()]
...
So the Premise Has Broken
> Even write(2) system calls for appending file content blocks in some
situation
> Then are we ready to start considering solution?
Slide 34
Slide 34 text
> We don't know yet why write(2) can block occasionally
> Without knowing what's going on underneath, we might take wrong option
> => Try stack sampling when experiencing slow write(2)
Fundamental Solution Born From
Fundamental Understanding
Slide 35
Slide 35 text
However…
> We had to service out the broker from problematic machine because
performance was getting worse daily
> The problematic machine's partially-broken HDD has been replaced
accidentally by maintenance
• I made a kernel crash => Ask hard reboot => "Replaced broken HDD at
that same time :)" #
Slide 36
Slide 36 text
Lost Reproducible Environment but…
> Don’t give up!
> Two things to reproduce :
• Write workload by Kafka broker
• Slowing down disk device
Slide 37
Slide 37 text
Write Workload by Kafka Broker
> Tiny C program with measuring write(2) call duration
Slide 38
Slide 38 text
Slowing Down Disk Device
> Disk slowing down != write(2) slowing down
• We're now investigating why write(2) slows down
> Attempted to hook `generic_make_request` and inject delay to simulate
slow disk with SystemTap
• but failed... well SystemTap is designed for safe-overhead execution
$ stap ... -e 'probe ioblock.request {
for (i = 0; i < 2147483647; i++) {
for (j = 0; j < 2147483647; j++)
{ pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1
…
ERROR: MAXACTION exceeded near identifier 'target' at
/usr/share/systemtap/tapset/linux/context.stp:380:10
Pass 5: run completed in 0usr/30sys/914real ms.
Pass 5: run failed. [man error::pass5]
Slide 39
Slide 39 text
$Use kprobe Writing Kernel Module
> kprobe : Linux kernel API that provides capability to hook arbitrary function's
enter/return and insert code (SystemTap uses it as the backend)
* https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c
Slide 40
Slide 40 text
Successfully Reproduced!
> Found that `[jbd2/sda3-8]` kernel task (ext4's journaling daemon) is blocked
> but couldn't explain why it affects `write(2)`'s processing time…
Pid: 4390, comm: jbd2/sda3-8 Not tainted ...
Call Trace:
…
[] ? generic_make_request+0x1/0x5a0
<> [] ? submit_bio+0x70/0x120
[] ? submit_bh+0x11d/0x1f0
[] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2]
…
[] ? kjournald2+0x0/0x220 [jbd2]
[] ? kthread+0x9e/0xc0
Slide 41
Slide 41 text
Approaching From Another Side
> Breakdown source of latency in
write(2)'s call-graph using another
SystemTap script *1
vfs_write()
do_sync_write()
ext4_file_write()
rw_verify_area()
probe kernel.function("vfs_write") {
time_entry[tid()] = gettimeofday_us()
...
probe kernel.function("vfs_write").return {
duration = gettimeofday_us() - time_entry[tid()]
...
*1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py
Slide 42
Slide 42 text
Adding Suspicious Part One-by-One
> … 14 attempts in total.
Slide 43
Slide 43 text
Found It’s Waiting on Lock
sys_write took 3890322 us in total:,
find_get_page=1,
__lock_page=3890280,
find_lock_page=3890290,
grab_cache_page_write_begin=3890294,
vfs_write=3890317
Slide 44
Slide 44 text
Analyze Lock Contention
> Hooking lock acquisition/release points: `__lock_page` and `unlock_page`
> Figured out that it's cascading issue
Application Calling
write(2) Writeback Task ext4’s Journaling
Daemon
Block
Waiting To Lock Page
Waiting Buffer to
Become Writable
Slide 45
Slide 45 text
Issue Overview
Slide 46
Slide 46 text
Overview - GC Logging Inside Safepoint
Slide 47
Slide 47 text
Overview - Writeback Task Blocked by
Journaling
Slide 48
Slide 48 text
Overview - Journaling Task Blocked by Slow
Response From Disk
Slide 49
Slide 49 text
Now Ready To Think Solution
> Use filesystem other than ext4 or disable journaling
> Use tmpfs as 1st storage to write GC log with realtime persistent backup
> => Both could solved the problem
Slide 50
Slide 50 text
Summary
> Our definition of Reliability Engineering covers wide range of responsibility
> SLI measurement
• Client’s (user’s) viewpoint for SLI measurement in distributed systems
> Troubleshooting / Optimization
• Fundamental solution from Fundamental problem understanding
• Accumulating knowledge
• KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block