Reliability Engineering Behind The Most Trusted Kafka Platform

2019 DevDay Reliability Engineering Behind the Most Trusted Kafka Platform
> Yuto Kawamura > LINE Z Part Team Software Engineer

Speaker > Yuto Kawamura > Senior Software Engineer @ Z
Part > Lead of Kafka Platform team > ❤ Performance Engineering > Apache Kafka Contributor > Past speaking • Kafka Summit 2017/2018 • LINE DEVELOPER DAY 2018

Apache Kafka At LINE

> ❌ How we use Kafka (or How great Kafka
is) Today’s Message > ❌ Best Practices for Using Kafka > ✅ Our Reliability Engineering Philosophy, Applicable Techniques of Reliability Engineering

> High-scalability/High-performance distributed system > Default data persistency > Streaming
Data Middleware Apache Kafka > Pub-Sub model for distributing data > Written in Scala, run on JVM

> Background asynchronous task processing > Event propagation among services
Typical Usage Service A Service B Service C Service A WebApp Service A  Processor Queueing Pub-Sub

Daily Inflow Messages 360 Billion User Services Over 60 Daily
IN+OUT Data 550 TB Kafka Is One of Most Major Middleware at LINE

Service Level Objective (SLO) Cluster Availability 99.999% 99th Percentile Response
Time < 40ms

Challenges Combined 360 Billion / Day Traffic Scale < 40ms
Response Time SLO Extreme Challenge

> SLO Visualization Reliability Engineering Kinds > Automation • Workflow
consistency • Free us from bothering work to concentrate on advanced works > Architecture Proposal • Provide library • Demonstrate Architecture for higher system availability/performance > Troubleshooting and Preventive Engineering

SLO Visualization

> API response time • For latency > API error
rate • For availability SLI To Measure SLO > SLI = Service Level Indicator • A measurable value which represents the service performance directly > Normally measured in server (broker) side

> Metrics from servers are not enough SLI Measurement in
Distributed Systems > Message Delivery Latency through Kafka = • Response time for Producer request • + Time taken for fail-over, retries, metadata update • + Response time for Consumer request • + Time taken for Consumer group coordination > Client library works cooperatively to server clusters for higher availability • retry • failover

> We can’t use metrics from actual service clients •
Business logic affects measurements • We (platform provider) must define the right model (config, usage) of “clients” Measure SLI From Client’s Viewpoint > Measure performance of entire system/flow • What clients actually see after all attempts inside client library

> Measure Performance of cluster from client (user)’s viewpoint Prober
> A dedicated Kafka clients to measure performance of entire stack > Run Distributed with multiple instances to • Simulate realistic usage of Kafka clients • Latency by fail-over • Distributed partitions assignment • To not lost metrics during restart (rolling restart)

Prober - Architecture Redis Prober Prober Kafka Cluster Broker 1
Broker 1 Broker 1 Broker 2 Broker 1 Broker 3 Writer Reader Reporter 1. Produce 2. Report Write 3. Consume 4. Report Read Failover Duration

Redis Prober - Architecture Prober Writer Reader Reporter W2P W1P
W3P … W2C W1C W3C … Write Buckets Read Buckets W1P - W2C  ————— = Loss Rate W1P Produce Consume Prometheus Metric Time

Prober - Scenario Configuration > Configure Multiple “Scenarios” • Scenario
= Client Impl x Client Config x Throughput/Latency Config • Measure Consistent Metrics for different setup / different requirement > Add 15 lines of YAML, get all measurements

SLI Dashboard

Troubleshooting

Despite of our Engineering Effort… > Trouble happens • Again,
large scale + high service level requirement = Trouble Factory > What makes the last difference is attitude against troubles difficult to solve • Try to clear out essential root cause? • Or just workaround and forget?

Fundamental Solution Making > Our Team’s philosophy is to always
persuade “Fundamental Solution” • E.g, given GC program, • Try changing JVM options around and see what happens? • Or do full-inspection to clear out why GC’s taking so long? > Fundamental Solution Making is better than workaround • Accumulates deep knowledge inside us • Applicable for similar but different issues in the same domain • Leaves improvements permanently

Troubleshooting - an Example

Symptom > API Response time degradation in 99th %ile •
Violates SLO! > Drop out from cluster by Zookeeper session timeout • If *2 then OUTAGE!

Observation - Thread Utilization Increase > Graph is showing threads
"idle ratio" so util is reverse > Equally for all threads • Request handler threads • Network threads

Observation - JVM Pause Time Increase 2019-05-13T01:42:44.936+0900: 1423318.149: Total Time
for Which Application Threads Were Stopped: 1.1904424 Seconds, Stopping Threads Took: 1.1883390 Seconds

Study - “Stopping Threads” Rarely Increase http://psy-lob-saw.blogspot.com/2015/12/safepoints.html > "safepoint" :
Moments in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc…) > "safepoint sync" : Moments that JVM is waiting all threads to voluntarily enter pause • => This is the increasing part

Observation - Disk Access Time Increase > At the same
time we saw the symptom

Guess What’s Going On > Long-running safepoint sync > +
Slow response time from disks > = JVM performs slow IO inside safepoint sync/operation ...? > In fact JVM writes logs on every safepoint occurrence

Study - Does write(2) System Call Blocks? > False for
the most of the cases > write(2) just enables "dirty flag" of the page > Kernel's background task periodically "writeback" dirty pages into block devices > In case calling write(2) is just about copying memory write(2) System Call Page Page Page Writeback Task Disk

Measure write(2) System Call’s Duration > Using SystemTap script *1
*1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp > SystemTap : dynamic tracing tool • Enables to inspect events inside the Linux kernel • APIs for scripting probe kernel.function("sys_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function(“sys_write").return { duration = gettimeofday_us() - time_entry[tid()] ...

Measure write(2) System Call’s Duration Sun May 19 00:32:03 2019,724
SYSCALL write from PID/TID 136422/136489 for FILE (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel] Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel] 0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so] 0x331fc71943 : _IO_file_write@@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so] 0x331fc72ef5 : _IO_do_write@@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so] 0x331fc720f8 : _IO_file_sync@@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so] 0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so] 0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] (snip) 0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so]

So the Premise Has Broken > Even write(2) system calls
for appending file content blocks in some situation > Then are we ready to start considering solution?

> We don't know yet why write(2) can block occasionally
> Without knowing what's going on underneath, we might take wrong option > => Try stack sampling when experiencing slow write(2) Fundamental Solution Born From Fundamental Understanding

However… > We had to service out the broker from
problematic machine because performance was getting worse daily > The problematic machine's partially-broken HDD has been replaced accidentally by maintenance • I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)" #

Lost Reproducible Environment but… > Don’t give up! > Two
things to reproduce : • Write workload by Kafka broker • Slowing down disk device

Write Workload by Kafka Broker > Tiny C program with
measuring write(2) call duration

Slowing Down Disk Device > Disk slowing down != write(2)
slowing down • We're now investigating why write(2) slows down > Attempted to hook `generic_make_request` and inject delay to simulate slow disk with SystemTap • but failed... well SystemTap is designed for safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 … ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5]

$Use kprobe Writing Kernel Module > kprobe : Linux kernel
API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses it as the backend) * https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c

Successfully Reproduced! > Found that `[jbd2/sda3-8]` kernel task (ext4's journaling
daemon) is blocked > but couldn't explain why it affects `write(2)`'s processing time… Pid: 4390, comm: jbd2/sda3-8 Not tainted ... Call Trace: … [<ffffffffxxxxxxxx>] ? generic_make_request+0x1/0x5a0 <<EOE>> [<ffffffffxxxxxxxx>] ? submit_bio+0x70/0x120 [<ffffffffxxxxxxxx>] ? submit_bh+0x11d/0x1f0 [<ffffffffxxxxxxxx>] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2] … [<ffffffffxxxxxxxx>] ? kjournald2+0x0/0x220 [jbd2] [<ffffffffxxxxxxxx>] ? kthread+0x9e/0xc0

Approaching From Another Side > Breakdown source of latency in
write(2)'s call-graph using another SystemTap script *1 vfs_write() do_sync_write() ext4_file_write() rw_verify_area() probe kernel.function("vfs_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function("vfs_write").return { duration = gettimeofday_us() - time_entry[tid()] ... *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py

Adding Suspicious Part One-by-One > … 14 attempts in total.

Found It’s Waiting on Lock sys_write took 3890322 us in
total:, find_get_page=1, __lock_page=3890280, find_lock_page=3890290, grab_cache_page_write_begin=3890294, vfs_write=3890317

Analyze Lock Contention > Hooking lock acquisition/release points: `__lock_page` and
`unlock_page` > Figured out that it's cascading issue Application Calling write(2) Writeback Task ext4’s Journaling Daemon Block Waiting To Lock Page Waiting Buffer to  Become Writable

Issue Overview

Overview - GC Logging Inside Safepoint

Overview - Writeback Task Blocked by Journaling

Overview - Journaling Task Blocked by Slow Response From Disk

Now Ready To Think Solution > Use filesystem other than
ext4 or disable journaling > Use tmpfs as 1st storage to write GC log with realtime persistent backup > => Both could solved the problem

Summary > Our definition of Reliability Engineering covers wide range
of responsibility > SLI measurement • Client’s (user’s) viewpoint for SLI measurement in distributed systems > Troubleshooting / Optimization • Fundamental solution from Fundamental problem understanding • Accumulating knowledge • KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block

End of Presentation

Reliability Engineering Behind The Most Trusted...

Reliability Engineering Behind The Most Trusted Kafka Platform

More Decks by LINE DevDay 2019

Other Decks in Technology

Featured

Transcript