Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitorama PDX 2017 - Ian Bennett

Monitorama PDX 2017 - Ian Bennett


May 24, 2017

More Decks by Monitorama

Other Decks in Technology


  1. Diving Instances • Stateless service run in Aurora • Dashboard

    Says… • Host Uptime avg < 1 hour • High GC • Aurora is Failing Healthchecks due to long GC pauses • Do some profiling • Attach profiler • Look at CPU Samples • It says…
  2. Diving Hosts – CPU Sampled Stacks This is determining the

    hash bucket for Scala HashTrieMap Key
  3. Diving Hosts - Fix • Map’s Key type is a

    Case Class • Scala Case Class auto-generates hashCode for you • Fields contained collections • Auto-generated hash code hashed over every element in collection • Solution - Provide custom hashCode method • Measure & Validate -> Uptime to hours/days
  4. Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or Code Change Measure

    Analyze Adjust Verify When to Peel • Make a single change at your current layer • Repeat the process • Avoid the urge to make too many changes • +/- impact needs to be measured! • Verify you haven’t changed behavior • Remember to re-rank and prioritize • You may uncover a hidden bottleneck • You may not have enough information to make the next change • If you can’t move the needle - peel
  5. Performance Fix - Pro Tip • Keep your code abstraction

    sane • Performance fixes should be as isolated as possible • Don’t expose these pain points to your consumers • WHY? • Your bitmap xor bit shifting super efficient String de-duplicating object pooling concurrent Set is awesome • Your consumers shouldn’t need to know the internals of your algorithm • This code becomes very brittle and harder to work around • Your micro-optimization might become obsolete • New Library implementation could remove bottleneck • Compiler improvements could make your solution second best • Usage patterns of the application can change • Uncaught bugs
  6. Metrics Tuning Tracing/Logs Profiling Custom Instrumentation or Code Change Fire

    - Critical Issues, Pager Angry! • Stick to the methodology • Don’t panic • Don’t create more problems by skipping steps or guessing • Your gut can/will be wrong • You will be tired • You will be frustrated • You may need to spend days iterating on changes to remove bottlenecks • Let the methodology guide you to a solution & reduce stress
  7. Heavy GC – String Formatting • Stateless service run on

    bare metal • High CPU, Expensive GC, Low Throughput • Attached profiler to service • Object Allocation recording enabled • String.format calls causing excessive allocations • Let’s take a look at what String.format does under the covers…
  8. Heavy GC – String Formatting String format for metric IDs

    Fully qualified Metric ID is generated
  9. Heavy GC – String Formatting (fix) Cache most frequently used

    metrics to prevent building fully qualified metric ID Add value to metric
  10. Heavy GC – But Perf Still Sad :’( • Introducing

    the String Formatting fix helped GC • But our performance issue wasn’t resolved • Another bottleneck lurking behind String.format existed
  11. Heavy GC – Second Fix • Analyzing the code found

    • Many sub-collections were being eagerly created • Only a small fraction of these collections were used per request cycle • Solution – Lazy Initialization • Use static immutable shared empty collection • Only on first mutation will the collection initialize • Applying these 2 fixes increased capacity for our distributed Key/Value service (Manhattan)
  12. Logging • Know your logging framework • Correctly scope your

    logging levels • Make sure debug/trace aren’t impacting load when disabled
  13. Deep Dive – Craziness… • Nuthatch is our temporal indexing

    service • The Logging Fix we went over was also part of this deep dive • Success Rate plummeting below 80% between 9 AM – 5 PM • Except during lunch hour!
  14. Deep Dive – Use Your Toolbox • Dashboards • GC

    spirals out of control • System memory usage nearing 100% • Can’t increase heap, GC too intense for simple tuning • Narrowed down specific hosts with worst case issues • Pulled Logs • Not enough information • Attached Profiler • Pin point hot code (CPU Profiling, Allocation Profiling) • Numerous fixes in canary, improvement, but not enough
  15. Deep Dive – Engineers Required • Need more context and

    more logging • Zipkin and LogLens don’t have enough info • Logging needs to be bolstered • Finagle is awesome – here’s why: • Your service is a function • Filters are easy to create • Create a low-overhead filter to track Top N worst offenders • Keep request/response context • Open up HTTP endpoint to sample stats from this filter • Culprit Found!
  16. Deep Dive – What Happened? • High usage service had

    increasing number of instances and unique metrics generated over time • Temporal Index lookup was VERY expensive • For a SINGLE Dashboard open on a developers desktop! • Impacted SR across all of our alerting, from one user • Updated User’s Dashboard, Impact Mitigated
  17. Deep Dive – What Happened? There’s MORE • This isn’t

    an acceptable solution • Something else is wrong here, we have caching in place • The code was reviewed, tested, improved things in the past • This shouldn’t have been a big issue
  18. Deep Dive – In Memory Cache What’s off here? Not

    blocking other puts Expensive calculation GC Spiral continues if cache doesn’t fill in time
  19. Bad Cache Behavior •processing... get(a) •processing... get(a) •processing... get(a) •processing...

    get(a) •processing... get(a) get(a) 1st cache retrieval! 1st insert into cache!
  20. Deep Dive – In Memory Cache (fixed) Always go to

    the cache, Load new value if not present Block if loading