Part > Lead of Kafka Platform team > ❤ Performance Engineering > Apache Kafka Contributor > Past speaking • Kafka Summit 2017/2018 • LINE DEVELOPER DAY 2018
consistency • Free us from bothering work to concentrate on advanced works > Architecture Proposal • Provide library • Demonstrate Architecture for higher system availability/performance > Troubleshooting and Preventive Engineering
rate • For availability SLI To Measure SLO > SLI = Service Level Indicator • A measurable value which represents the service performance directly > Normally measured in server (broker) side
Distributed Systems > Message Delivery Latency through Kafka = • Response time for Producer request • + Time taken for fail-over, retries, metadata update • + Response time for Consumer request • + Time taken for Consumer group coordination > Client library works cooperatively to server clusters for higher availability • retry • failover
Business logic affects measurements • We (platform provider) must define the right model (config, usage) of “clients” Measure SLI From Client’s Viewpoint > Measure performance of entire system/flow • What clients actually see after all attempts inside client library
> A dedicated Kafka clients to measure performance of entire stack > Run Distributed with multiple instances to • Simulate realistic usage of Kafka clients • Latency by fail-over • Distributed partitions assignment • To not lost metrics during restart (rolling restart)
= Client Impl x Client Config x Throughput/Latency Config • Measure Consistent Metrics for different setup / different requirement > Add 15 lines of YAML, get all measurements
large scale + high service level requirement = Trouble Factory > What makes the last difference is attitude against troubles difficult to solve • Try to clear out essential root cause? • Or just workaround and forget?
persuade “Fundamental Solution” • E.g, given GC program, • Try changing JVM options around and see what happens? • Or do full-inspection to clear out why GC’s taking so long? > Fundamental Solution Making is better than workaround • Accumulates deep knowledge inside us • Applicable for similar but different issues in the same domain • Leaves improvements permanently
Moments in JVM that it stops all app threads to safely execute unsafe ops (GC, jstack, etc…) > "safepoint sync" : Moments that JVM is waiting all threads to voluntarily enter pause • => This is the increasing part
the most of the cases > write(2) just enables "dirty flag" of the page > Kernel's background task periodically "writeback" dirty pages into block devices > In case calling write(2) is just about copying memory write(2) System Call Page Page Page Writeback Task Disk
> Without knowing what's going on underneath, we might take wrong option > => Try stack sampling when experiencing slow write(2) Fundamental Solution Born From Fundamental Understanding
problematic machine because performance was getting worse daily > The problematic machine's partially-broken HDD has been replaced accidentally by maintenance • I made a kernel crash => Ask hard reboot => "Replaced broken HDD at that same time :)" #
slowing down • We're now investigating why write(2) slows down > Attempted to hook `generic_make_request` and inject delay to simulate slow disk with SystemTap • but failed... well SystemTap is designed for safe-overhead execution $ stap ... -e 'probe ioblock.request { for (i = 0; i < 2147483647; i++) { for (j = 0; j < 2147483647; j++) { pl |= 1 ^ 255 } } }' -D STP_NO_OVERLOAD -D MAXACTION=-1 … ERROR: MAXACTION exceeded near identifier 'target' at /usr/share/systemtap/tapset/linux/context.stp:380:10 Pass 5: run completed in 0usr/30sys/914real ms. Pass 5: run failed. [man error::pass5]
API that provides capability to hook arbitrary function's enter/return and insert code (SystemTap uses it as the backend) * https://gist.github.com/kawamuray/a27aaa2d0668d0050537651978735321#file-kprobe_delay-c
of responsibility > SLI measurement • Client’s (user’s) viewpoint for SLI measurement in distributed systems > Troubleshooting / Optimization • Fundamental solution from Fundamental problem understanding • Accumulating knowledge • KAFKA-4614, KAFKA-7504 => IO in JVM pause, read/write block