Monitorama PDX 2017 - Ian Bennett

Critical to Calm Debugging Distributed Systems @enbnt #talk-ian-bennett

Case Study – Diving Instances A Simple Fix

Diving Instances • Stateless service run in Aurora • Dashboard
Says… • Host Uptime avg < 1 hour • High GC • Aurora is Failing Healthchecks due to long GC pauses • Do some profiling • Attach profiler • Look at CPU Samples • It says…

Diving Hosts – CPU Sampled Stacks This is determining the
hash bucket for Scala HashTrieMap Key

Diving Hosts - Fix • Map’s Key type is a
Case Class • Scala Case Class auto-generates hashCode for you • Fields contained collections • Auto-generated hash code hashed over every element in collection • Solution - Provide custom hashCode method • Measure & Validate -> Uptime to hours/days

Diving Hosts - Uptime Fix Applied

Methodology How we approach problems Measure Analyze Adjust Verify

Peeling the Onion Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or
Code Change (start outside, work in)

Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or Code Change Measure
Analyze Adjust Verify When to Peel • Make a single change at your current layer • Repeat the process • Avoid the urge to make too many changes • +/- impact needs to be measured! • Verify you haven’t changed behavior • Remember to re-rank and prioritize • You may uncover a hidden bottleneck • You may not have enough information to make the next change • If you can’t move the needle - peel

Performance IO Memory CPU Profiling! – What is it good
for?

Profiling! – But Distributed!?

Profiling! – But Distributed!? – Probably This

Common Issues • Logging • Auto-boxing • Caching/Memoization • Concurrency
Issues • Expensive RPC

Performance Fix - Pro Tip • Keep your code abstraction
sane • Performance fixes should be as isolated as possible • Don’t expose these pain points to your consumers • WHY? • Your bitmap xor bit shifting super efficient String de-duplicating object pooling concurrent Set is awesome • Your consumers shouldn’t need to know the internals of your algorithm • This code becomes very brittle and harder to work around • Your micro-optimization might become obsolete • New Library implementation could remove bottleneck • Compiler improvements could make your solution second best • Usage patterns of the application can change • Uncaught bugs

Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or Code Change Fire
- Critical Issues, Pager Angry! • Stick to the methodology • Don’t panic • Don’t create more problems by skipping steps or guessing • Your gut can/will be wrong • You will be tired • You will be frustrated • You may need to spend days iterating on changes to remove bottlenecks • Let the methodology guide you to a solution & reduce stress

Case Study – Heavy GC When you can’t tune your
way out

Heavy GC – String Formatting • Stateless service run on
bare metal • High CPU, Expensive GC, Low Throughput • Attached profiler to service • Object Allocation recording enabled • String.format calls causing excessive allocations • Let’s take a look at what String.format does under the covers…

Heavy GC – String Formatting String format for metric IDs
Fully qualified Metric ID is generated

Heavy GC – String Formatting (Bad?)

Heavy GC – String Formatting (fix) Cache most frequently used
metrics to prevent building fully qualified metric ID Add value to metric

Heavy GC – But Perf Still Sad :’( • Introducing
the String Formatting fix helped GC • But our performance issue wasn’t resolved • Another bottleneck lurking behind String.format existed

Heavy GC – Object Allocations Eagerly initializing many collections

Heavy GC – Second Fix • Analyzing the code found
• Many sub-collections were being eagerly created • Only a small fraction of these collections were used per request cycle • Solution – Lazy Initialization • Use static immutable shared empty collection • Only on first mutation will the collection initialize • Applying these 2 fixes increased capacity for our distributed Key/Value service (Manhattan)

Case Study – Logging Know your logger!

Logging Can anyone tell me what’s happening here?

Logging Now what’s happening?

Logging • Know your logging framework • Correctly scope your
logging levels • Make sure debug/trace aren’t impacting load when disabled

Case Study – Craziness When Profiling isn’t enough

Deep Dive – Craziness… • Nuthatch is our temporal indexing
service • The Logging Fix we went over was also part of this deep dive • Success Rate plummeting below 80% between 9 AM – 5 PM • Except during lunch hour!

Deep Dive – SR Craziness

Deep Dive – Use Your Toolbox • Dashboards • GC
spirals out of control • System memory usage nearing 100% • Can’t increase heap, GC too intense for simple tuning • Narrowed down specific hosts with worst case issues • Pulled Logs • Not enough information • Attached Profiler • Pin point hot code (CPU Profiling, Allocation Profiling) • Numerous fixes in canary, improvement, but not enough

Deep Dive – Engineers Required • Need more context and
more logging • Zipkin and LogLens don’t have enough info • Logging needs to be bolstered • Finagle is awesome – here’s why: • Your service is a function • Filters are easy to create • Create a low-overhead filter to track Top N worst offenders • Keep request/response context • Open up HTTP endpoint to sample stats from this filter • Culprit Found!

Deep Dive – SR Craziness Fix Applied!

Deep Dive - Failures Fix Applied! 800k failures per hour!

Deep Dive – What Happened? • High usage service had
increasing number of instances and unique metrics generated over time • Temporal Index lookup was VERY expensive • For a SINGLE Dashboard open on a developers desktop! • Impacted SR across all of our alerting, from one user • Updated User’s Dashboard, Impact Mitigated

Deep Dive – What Happened? There’s MORE • This isn’t
an acceptable solution • Something else is wrong here, we have caching in place • The code was reviewed, tested, improved things in the past • This shouldn’t have been a big issue

Deep Dive – In Memory Cache What’s off here?

Deep Dive – In Memory Cache What’s off here? Not
blocking other puts Expensive calculation GC Spiral continues if cache doesn’t fill in time

Bad Cache Behavior •processing... get(a) •processing... get(a) •processing... get(a) •processing...
get(a) •processing... get(a) get(a) 1st cache retrieval! 1st insert into cache!

Deep Dive – In Memory Cache (fixed) Always go to
the cache, Load new value if not present Block if loading

Better Cache Behavior Future[Value] get(a) get(a) get(a) get(a) get(a) get(a)
get(a) get(a) get(a) get(a get(a)

Deep Dive – System Memory % Used Fix Rolled Out!
~30% more headroom

Monitorama PDX 2017 - Ian Bennett

Monitorama PDX 2017 - Ian Bennett

More Decks by Monitorama

Other Decks in Technology

Featured

Transcript