Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitorama PDX 2017 - Ian Bennett

Monitorama PDX 2017 - Ian Bennett

0580d500edfdb2e5e80e4732ac8df1ea?s=128

Monitorama

May 24, 2017
Tweet

Transcript

  1. Critical to Calm Debugging Distributed Systems @enbnt #talk-ian-bennett

  2. Case Study – Diving Instances A Simple Fix

  3. Diving Instances • Stateless service run in Aurora • Dashboard

    Says… • Host Uptime avg < 1 hour • High GC • Aurora is Failing Healthchecks due to long GC pauses • Do some profiling • Attach profiler • Look at CPU Samples • It says…
  4. Diving Hosts – CPU Sampled Stacks This is determining the

    hash bucket for Scala HashTrieMap Key
  5. Diving Hosts - Fix • Map’s Key type is a

    Case Class • Scala Case Class auto-generates hashCode for you • Fields contained collections • Auto-generated hash code hashed over every element in collection • Solution - Provide custom hashCode method • Measure & Validate -> Uptime to hours/days
  6. Diving Hosts - Uptime Fix Applied

  7. Methodology How we approach problems Measure Analyze Adjust Verify

  8. Peeling the Onion Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or

    Code Change (start outside, work in)
  9. Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or Code Change Measure

    Analyze Adjust Verify When to Peel • Make a single change at your current layer • Repeat the process • Avoid the urge to make too many changes • +/- impact needs to be measured! • Verify you haven’t changed behavior • Remember to re-rank and prioritize • You may uncover a hidden bottleneck • You may not have enough information to make the next change • If you can’t move the needle - peel
  10. Performance IO Memory CPU Profiling! – What is it good

    for?
  11. Profiling! – But Distributed!?

  12. Profiling! – But Distributed!? – Probably This

  13. Common Issues • Logging • Auto-boxing • Caching/Memoization • Concurrency

    Issues • Expensive RPC
  14. Performance Fix - Pro Tip • Keep your code abstraction

    sane • Performance fixes should be as isolated as possible • Don’t expose these pain points to your consumers • WHY? • Your bitmap xor bit shifting super efficient String de-duplicating object pooling concurrent Set is awesome • Your consumers shouldn’t need to know the internals of your algorithm • This code becomes very brittle and harder to work around • Your micro-optimization might become obsolete • New Library implementation could remove bottleneck • Compiler improvements could make your solution second best • Usage patterns of the application can change • Uncaught bugs
  15. Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or Code Change Fire

    - Critical Issues, Pager Angry! • Stick to the methodology • Don’t panic • Don’t create more problems by skipping steps or guessing • Your gut can/will be wrong • You will be tired • You will be frustrated • You may need to spend days iterating on changes to remove bottlenecks • Let the methodology guide you to a solution & reduce stress
  16. Case Study – Heavy GC When you can’t tune your

    way out
  17. Heavy GC – String Formatting • Stateless service run on

    bare metal • High CPU, Expensive GC, Low Throughput • Attached profiler to service • Object Allocation recording enabled • String.format calls causing excessive allocations • Let’s take a look at what String.format does under the covers…
  18. Heavy GC – String Formatting String format for metric IDs

    Fully qualified Metric ID is generated
  19. Heavy GC – String Formatting (Bad?)

  20. Heavy GC – String Formatting (fix) Cache most frequently used

    metrics to prevent building fully qualified metric ID Add value to metric
  21. Heavy GC – But Perf Still Sad :’( • Introducing

    the String Formatting fix helped GC • But our performance issue wasn’t resolved • Another bottleneck lurking behind String.format existed
  22. Heavy GC – Object Allocations Eagerly initializing many collections

  23. Heavy GC – Second Fix • Analyzing the code found

    • Many sub-collections were being eagerly created • Only a small fraction of these collections were used per request cycle • Solution – Lazy Initialization • Use static immutable shared empty collection • Only on first mutation will the collection initialize • Applying these 2 fixes increased capacity for our distributed Key/Value service (Manhattan)
  24. Case Study – Logging Know your logger!

  25. Logging Can anyone tell me what’s happening here?

  26. Logging Now what’s happening?

  27. Logging • Know your logging framework • Correctly scope your

    logging levels • Make sure debug/trace aren’t impacting load when disabled
  28. Case Study – Craziness When Profiling isn’t enough

  29. Deep Dive – Craziness… • Nuthatch is our temporal indexing

    service • The Logging Fix we went over was also part of this deep dive • Success Rate plummeting below 80% between 9 AM – 5 PM • Except during lunch hour!
  30. Deep Dive – SR Craziness

  31. None
  32. Deep Dive – Use Your Toolbox • Dashboards • GC

    spirals out of control • System memory usage nearing 100% • Can’t increase heap, GC too intense for simple tuning • Narrowed down specific hosts with worst case issues • Pulled Logs • Not enough information • Attached Profiler • Pin point hot code (CPU Profiling, Allocation Profiling) • Numerous fixes in canary, improvement, but not enough
  33. Deep Dive – Engineers Required • Need more context and

    more logging • Zipkin and LogLens don’t have enough info • Logging needs to be bolstered • Finagle is awesome – here’s why: • Your service is a function • Filters are easy to create • Create a low-overhead filter to track Top N worst offenders • Keep request/response context • Open up HTTP endpoint to sample stats from this filter • Culprit Found!
  34. Deep Dive – SR Craziness Fix Applied!

  35. Deep Dive - Failures Fix Applied! 800k failures per hour!

  36. Deep Dive – What Happened? • High usage service had

    increasing number of instances and unique metrics generated over time • Temporal Index lookup was VERY expensive • For a SINGLE Dashboard open on a developers desktop! • Impacted SR across all of our alerting, from one user • Updated User’s Dashboard, Impact Mitigated
  37. Deep Dive – What Happened? There’s MORE • This isn’t

    an acceptable solution • Something else is wrong here, we have caching in place • The code was reviewed, tested, improved things in the past • This shouldn’t have been a big issue
  38. Deep Dive – In Memory Cache What’s off here?

  39. Deep Dive – In Memory Cache What’s off here? Not

    blocking other puts Expensive calculation GC Spiral continues if cache doesn’t fill in time
  40. Bad Cache Behavior •processing... get(a) •processing... get(a) •processing... get(a) •processing...

    get(a) •processing... get(a) get(a) 1st cache retrieval! 1st insert into cache!
  41. Deep Dive – In Memory Cache (fixed) Always go to

    the cache, Load new value if not present Block if loading
  42. Better Cache Behavior Future[Value] get(a) get(a) get(a) get(a) get(a) get(a)

    get(a) get(a) get(a) get(a get(a)
  43. Deep Dive – System Memory % Used Fix Rolled Out!

    ~30% more headroom
  44. None
  45. None
  46. None