Slide 1

Slide 1 text

2019 DevDay Reliability Engineering Behind the Most Trusted Kafka Platform > Yuto Kawamura > LINE Z Part Team Software Engineer

Slide 2

Slide 2 text

Speaker > Yuto Kawamura > Senior Software Engineer @ Z Part > Lead of Kafka Platform team > ❤ Performance Engineering > Apache Kafka Contributor > Past speaking • Kafka Summit 2017/2018 • LINE DEVELOPER DAY 2018

Slide 3

Slide 3 text

Apache Kafka At LINE

Slide 4

Slide 4 text

> ❌ How we use Kafka (or How great Kafka is) Today’s Message > ❌ Best Practices for Using Kafka > ✅ Our Reliability Engineering Philosophy, Applicable Techniques of Reliability Engineering

Slide 5

Slide 5 text

> High-scalability/High-performance distributed system > Default data persistency > Streaming Data Middleware Apache Kafka > Pub-Sub model for distributing data > Written in Scala, run on JVM

Slide 6

Slide 6 text

> Background asynchronous task processing > Event propagation among services Typical Usage Service A Service B Service C Service A WebApp Service A
 Processor Queueing Pub-Sub

Slide 7

Slide 7 text

Daily Inflow Messages 360 Billion User Services Over 60 Daily IN+OUT Data 550 TB Kafka Is One of Most Major Middleware at LINE

Slide 8

Slide 8 text

Service Level Objective (SLO) Cluster Availability 99.999% 99th Percentile Response Time < 40ms

Slide 9

Slide 9 text

Challenges Combined 360 Billion / Day Traffic Scale < 40ms Response Time SLO Extreme Challenge

Slide 10

Slide 10 text

> SLO Visualization Reliability Engineering Kinds > Automation • Workflow consistency • Free us from bothering work to concentrate on advanced works > Architecture Proposal • Provide library • Demonstrate Architecture for higher system availability/performance > Troubleshooting and Preventive Engineering

Slide 11

Slide 11 text

SLO Visualization

Slide 12

Slide 12 text

> API response time • For latency > API error rate • For availability SLI To Measure SLO > SLI = Service Level Indicator • A measurable value which represents the service performance directly > Normally measured in server (broker) side

Slide 13

Slide 13 text

> Metrics from servers are not enough SLI Measurement in Distributed Systems > Message Delivery Latency through Kafka = • Response time for Producer request • + Time taken for fail-over, retries, metadata update • + Response time for Consumer request • + Time taken for Consumer group coordination > Client library works cooperatively to server clusters for higher availability • retry • failover

Slide 14

Slide 14 text

> We can’t use metrics from actual service clients • Business logic affects measurements • We (platform provider) must define the right model (config, usage) of “clients” Measure SLI From Client’s Viewpoint > Measure performance of entire system/flow • What clients actually see after all attempts inside client library

Slide 15

Slide 15 text

> Measure Performance of cluster from client (user)’s viewpoint Prober > A dedicated Kafka clients to measure performance of entire stack > Run Distributed with multiple instances to • Simulate realistic usage of Kafka clients • Latency by fail-over • Distributed partitions assignment • To not lost metrics during restart (rolling restart)

Slide 16

Slide 16 text

Prober - Architecture Redis Prober Prober Kafka Cluster Broker 1 Broker 1 Broker 1 Broker 2 Broker 1 Broker 3 Writer Reader Reporter 1. Produce 2. Report Write 3. Consume 4. Report Read Failover Duration

Slide 17

Slide 17 text

Redis Prober - Architecture Prober Writer Reader Reporter W2P W1P W3P … W2C W1C W3C … Write Buckets Read Buckets W1P - W2C
 ————— = Loss Rate W1P Produce Consume Prometheus Metric Time

Slide 18

Slide 18 text

Prober - Scenario Configuration > Configure Multiple “Scenarios” • Scenario = Client Impl x Client Config x Throughput/Latency Config • Measure Consistent Metrics for different setup / different requirement > Add 15 lines of YAML, get all measurements

Slide 19

Slide 19 text

SLI Dashboard

Slide 20

Slide 20 text

Troubleshooting

Slide 21

Slide 21 text

Despite of our Engineering Effort… > Trouble happens • Again, large scale + high service level requirement = Trouble Factory > What makes the last difference is attitude against troubles difficult to solve • Try to clear out essential root cause? • Or just workaround and forget?

Slide 22

Slide 22 text

Fundamental Solution Making > Our Team’s philosophy is to always persuade “Fundamental Solution” • E.g, given GC program, • Try changing JVM options around and see what happens? • Or do full-inspection to clear out why GC’s taking so long? > Fundamental Solution Making is better than workaround • Accumulates deep knowledge inside us • Applicable for similar but different issues in the same domain • Leaves improvements permanently

Slide 23

Slide 23 text

Troubleshooting - an Example

Slide 24

Slide 24 text

Symptom > API Response time degradation in 99th %ile • Violates SLO! > Drop out from cluster by Zookeeper session timeout • If *2 then OUTAGE!

Slide 25

Slide 25 text

Observation - Thread Utilization Increase > Graph is showing threads "idle ratio" so util is reverse > Equally for all threads • Request handler threads • Network threads

Slide 26

Slide 26 text

Observation - JVM Pause Time Increase 2019-05-13T01:42:44.936+0900: 1423318.149: Total Time for Which Application Threads Were Stopped: 1.1904424 Seconds, Stopping Threads Took: 1.1883390 Seconds

Slide 27

Slide 27 text

Study - “Stopping Threads” Rarely Increase http://psy-lob-saw.blogspot.com/2015/12/safepoints.html > "safepoint" : Moments in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc…) > "safepoint sync" : Moments that JVM is waiting all threads to voluntarily enter pause • => This is the increasing part

Slide 28

Slide 28 text

Observation - Disk Access Time Increase > At the same time we saw the symptom

Slide 29

Slide 29 text

Guess What’s Going On > Long-running safepoint sync > + Slow response time from disks > = JVM performs slow IO inside safepoint sync/operation ...? > In fact JVM writes logs on every safepoint occurrence

Slide 30

Slide 30 text

Study - Does write(2) System Call Blocks? > False for the most of the cases > write(2) just enables "dirty flag" of the page > Kernel's background task periodically "writeback" dirty pages into block devices > In case calling write(2) is just about copying memory write(2) System Call Page Page Page Writeback Task Disk

Slide 31

Slide 31 text

Measure write(2) System Call’s Duration > Using SystemTap script *1 *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-slow-syswrite-stp > SystemTap : dynamic tracing tool • Enables to inspect events inside the Linux kernel • APIs for scripting probe kernel.function("sys_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function(“sys_write").return { duration = gettimeofday_us() - time_entry[tid()] ...

Slide 32

Slide 32 text

Measure write(2) System Call’s Duration Sun May 19 00:32:03 2019,724 SYSCALL write from PID/TID 136422/136489 for FILE (snip)/kafka/kafkaServer-gc_2019-04-26_14-20-46.log.1.current took 2843 ms Returning from: 0xffffffffxxxxxxxx : sys_write+0x0/0xb0 [kernel] Returning to : 0xffffffffxxxxxxxx : tracesys+0xd9/0xde [kernel] 0x331fcdb7ed : __write_nocancel+0x24/0x57 [/lib64/libc-2.12.so] 0x331fc71943 : _IO_file_write@@GLIBC_2.2.5+0x43/0xa0 [/lib64/libc-2.12.so] 0x331fc72ef5 : _IO_do_write@@GLIBC_2.2.5+0x95/0x12a [/lib64/libc-2.12.so] 0x331fc720f8 : _IO_file_sync@@GLIBC_2.2.5+0xb8/0xc0 [/lib64/libc-2.12.so] 0x331fc6680a : _IO_fflush+0x6a/0xf7 [/lib64/libc-2.12.so] 0x7f6848c4fee8 : _ZN18ReferenceProcessor29process_discovered_referencesEP17BoolObjectClosure P10OopClosureP11VoidClosureP27AbstractRefProcTaskExecutorP7GCTimer4GCId+0x88/0x2c0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f684886443d : _ZN15G1CollectedHeap29process_discovered_referencesEj+0x28d/0x750 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] (snip) 0x7f6848d835be : _ZN8VMThread4loopEv+0x1ce/0x5d0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848d83a30 : _ZN8VMThread3runEv+0x70/0x1b0 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x7f6848bdec68 : _ZL10java_startP6Thread+0x108/0x210 [/usr/jdk1.8.0_66/jre/lib/amd64/server/libjvm.so] 0x3320007aa1 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x331fce8bcd : __clone+0x6d/0x90 [/lib64/libc-2.12.so]

Slide 33

Slide 33 text

So the Premise Has Broken > Even write(2) system calls for appending file content blocks in some situation > Then are we ready to start considering solution?

Slide 34

Slide 34 text

> We don't know yet why write(2) can block occasionally > Without knowing what's going on underneath, we might take wrong option > => Try stack sampling when experiencing slow write(2) Fundamental Solution Born From Fundamental Understanding

Slide 35

Slide 35 text

However… > We had to service out the broker from problematic machine because performance was getting worse daily > The problematic machine's partially-broken HDD has been replaced accidentally by maintenance • I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)" #

Slide 36

Slide 36 text

Lost Reproducible Environment but… > Don’t give up! > Two things to reproduce : • Write workload by Kafka broker • Slowing down disk device

Slide 37

Slide 37 text

Write Workload by Kafka Broker > Tiny C program with measuring write(2) call duration

Slide 38

Slide 38 text

Slowing Down Disk Device > Disk slowing down != write(2) slowing down • We're now investigating why write(2) slows down > Attempted to hook `generic_make_request` and inject delay to simulate slow disk with SystemTap • but failed... well SystemTap is designed for safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 … ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5]

Slide 39

Slide 39 text

$Use kprobe Writing Kernel Module > kprobe : Linux kernel API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses it as the backend) * https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c

Slide 40

Slide 40 text

Successfully Reproduced! > Found that `[jbd2/sda3-8]` kernel task (ext4's journaling daemon) is blocked > but couldn't explain why it affects `write(2)`'s processing time… Pid: 4390, comm: jbd2/sda3-8 Not tainted ... Call Trace: … [] ? generic_make_request+0x1/0x5a0 <> [] ? submit_bio+0x70/0x120 [] ? submit_bh+0x11d/0x1f0 [] ? jbd2_journal_commit_transaction+0x598/0x14f0 [jbd2] … [] ? kjournald2+0x0/0x220 [jbd2] [] ? kthread+0x9e/0xc0

Slide 41

Slide 41 text

Approaching From Another Side > Breakdown source of latency in write(2)'s call-graph using another SystemTap script *1 vfs_write() do_sync_write() ext4_file_write() rw_verify_area() probe kernel.function("vfs_write") { time_entry[tid()] = gettimeofday_us() ... probe kernel.function("vfs_write").return { duration = gettimeofday_us() - time_entry[tid()] ... *1 https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-drilldown-latency-py

Slide 42

Slide 42 text

Adding Suspicious Part One-by-One > … 14 attempts in total.

Slide 43

Slide 43 text

Found It’s Waiting on Lock sys_write took 3890322 us in total:, find_get_page=1, __lock_page=3890280, find_lock_page=3890290, grab_cache_page_write_begin=3890294, vfs_write=3890317

Slide 44

Slide 44 text

Analyze Lock Contention > Hooking lock acquisition/release points: `__lock_page` and `unlock_page` > Figured out that it's cascading issue Application Calling write(2) Writeback Task ext4’s Journaling Daemon Block Waiting To Lock Page Waiting Buffer to
 Become Writable

Slide 45

Slide 45 text

Issue Overview

Slide 46

Slide 46 text

Overview - GC Logging Inside Safepoint

Slide 47

Slide 47 text

Overview - Writeback Task Blocked by Journaling

Slide 48

Slide 48 text

Overview - Journaling Task Blocked by Slow Response From Disk

Slide 49

Slide 49 text

Now Ready To Think Solution > Use filesystem other than ext4 or disable journaling > Use tmpfs as 1st storage to write GC log with realtime persistent backup > => Both could solved the problem

Slide 50

Slide 50 text

Summary > Our definition of Reliability Engineering covers wide range of responsibility > SLI measurement • Client’s (user’s) viewpoint for SLI measurement in distributed systems > Troubleshooting / Optimization • Fundamental solution from Fundamental problem understanding • Accumulating knowledge • KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block

Slide 51

Slide 51 text

End of Presentation