$30 off During Our Annual Pro Sale. View Details »

Monitorama PDX 2017 - Ian Bennett

Monitorama PDX 2017 - Ian Bennett

Monitorama

May 24, 2017
Tweet

More Decks by Monitorama

Other Decks in Technology

Transcript

  1. Critical to Calm
    Debugging Distributed Systems
    @enbnt
    #talk-ian-bennett

    View Slide

  2. Case Study – Diving Instances
    A Simple Fix

    View Slide

  3. Diving Instances
    • Stateless service run in Aurora
    • Dashboard Says…
    • Host Uptime avg < 1 hour
    • High GC
    • Aurora is Failing Healthchecks due to long GC pauses
    • Do some profiling
    • Attach profiler
    • Look at CPU Samples
    • It says…

    View Slide

  4. Diving Hosts – CPU Sampled Stacks
    This is determining the hash bucket for Scala
    HashTrieMap Key

    View Slide

  5. Diving Hosts - Fix
    • Map’s Key type is a Case Class
    • Scala Case Class auto-generates hashCode for you
    • Fields contained collections
    • Auto-generated hash code hashed over every element in collection
    • Solution - Provide custom hashCode method
    • Measure & Validate -> Uptime to hours/days

    View Slide

  6. Diving Hosts - Uptime
    Fix Applied

    View Slide

  7. Methodology
    How we approach problems
    Measure
    Analyze
    Adjust
    Verify

    View Slide

  8. Peeling the Onion
    Metrics
    Tuning
    Tracing/Logs
    Profiling
    Custom
    Instrumentation
    or
    Code Change
    (start outside, work in)

    View Slide

  9. Metrics
    Tuning
    Tracing/Logs
    Profiling
    Custom
    Instrumentation
    or
    Code Change
    Measure
    Analyze
    Adjust
    Verify
    When to Peel
    • Make a single change at your current layer
    • Repeat the process
    • Avoid the urge to make too many changes
    • +/- impact needs to be measured!
    • Verify you haven’t changed behavior
    • Remember to re-rank and prioritize
    • You may uncover a hidden bottleneck
    • You may not have enough information to make
    the next change
    • If you can’t move the needle - peel

    View Slide

  10. Performance
    IO
    Memory
    CPU
    Profiling! – What is it good for?

    View Slide

  11. Profiling! – But Distributed!?

    View Slide

  12. Profiling! – But Distributed!? – Probably This

    View Slide

  13. Common Issues
    • Logging
    • Auto-boxing
    • Caching/Memoization
    • Concurrency Issues
    • Expensive RPC

    View Slide

  14. Performance Fix - Pro Tip
    • Keep your code abstraction sane
    • Performance fixes should be as isolated as possible
    • Don’t expose these pain points to your consumers
    • WHY?
    • Your bitmap xor bit shifting super efficient String de-duplicating object pooling
    concurrent Set is awesome
    • Your consumers shouldn’t need to know the internals of your algorithm
    • This code becomes very brittle and harder to work around
    • Your micro-optimization might become obsolete
    • New Library implementation could remove bottleneck
    • Compiler improvements could make your solution second best
    • Usage patterns of the application can change
    • Uncaught bugs

    View Slide


  15. Metrics
    Tuning
    Tracing/Logs
    Profiling
    Custom
    Instrumentation
    or
    Code Change
    Fire - Critical Issues, Pager Angry!
    • Stick to the methodology
    • Don’t panic
    • Don’t create more problems by
    skipping steps or guessing
    • Your gut can/will be wrong
    • You will be tired
    • You will be frustrated
    • You may need to spend days
    iterating on changes to remove
    bottlenecks
    • Let the methodology guide you to
    a solution & reduce stress

    View Slide

  16. Case Study – Heavy GC
    When you can’t tune your way out

    View Slide

  17. Heavy GC – String Formatting
    • Stateless service run on bare metal
    • High CPU, Expensive GC, Low Throughput
    • Attached profiler to service
    • Object Allocation recording enabled
    • String.format calls causing excessive allocations
    • Let’s take a look at what String.format does under the covers…

    View Slide

  18. Heavy GC – String Formatting
    String format for metric IDs
    Fully qualified Metric ID is generated

    View Slide

  19. Heavy GC – String Formatting (Bad?)

    View Slide

  20. Heavy GC – String Formatting (fix)
    Cache most frequently used metrics to prevent
    building fully qualified metric ID
    Add value to metric

    View Slide

  21. Heavy GC – But Perf Still Sad :’(
    • Introducing the String Formatting fix helped GC
    • But our performance issue wasn’t resolved
    • Another bottleneck lurking behind String.format existed

    View Slide

  22. Heavy GC – Object Allocations
    Eagerly initializing
    many collections

    View Slide

  23. Heavy GC – Second Fix
    • Analyzing the code found
    • Many sub-collections were being eagerly created
    • Only a small fraction of these collections were used per request cycle
    • Solution – Lazy Initialization
    • Use static immutable shared empty collection
    • Only on first mutation will the collection initialize
    • Applying these 2 fixes increased capacity for our distributed
    Key/Value service (Manhattan)

    View Slide

  24. Case Study – Logging
    Know your logger!

    View Slide

  25. Logging
    Can anyone tell me what’s happening here?

    View Slide

  26. Logging
    Now what’s happening?

    View Slide

  27. Logging
    • Know your logging framework
    • Correctly scope your logging levels
    • Make sure debug/trace aren’t impacting load when disabled

    View Slide

  28. Case Study – Craziness
    When Profiling isn’t enough

    View Slide

  29. Deep Dive – Craziness…
    • Nuthatch is our temporal indexing service
    • The Logging Fix we went over was also part of this deep dive
    • Success Rate plummeting below 80% between 9 AM – 5 PM
    • Except during lunch hour!

    View Slide

  30. Deep Dive – SR Craziness

    View Slide

  31. View Slide

  32. Deep Dive – Use Your Toolbox
    • Dashboards
    • GC spirals out of control
    • System memory usage nearing 100%
    • Can’t increase heap, GC too intense for simple tuning
    • Narrowed down specific hosts with worst case issues
    • Pulled Logs
    • Not enough information
    • Attached Profiler
    • Pin point hot code (CPU Profiling, Allocation Profiling)
    • Numerous fixes in canary, improvement, but not enough

    View Slide

  33. Deep Dive – Engineers Required
    • Need more context and more logging
    • Zipkin and LogLens don’t have enough info
    • Logging needs to be bolstered
    • Finagle is awesome – here’s why:
    • Your service is a function
    • Filters are easy to create
    • Create a low-overhead filter to track Top N worst offenders
    • Keep request/response context
    • Open up HTTP endpoint to sample stats from this filter
    • Culprit Found!

    View Slide

  34. Deep Dive – SR Craziness
    Fix Applied!

    View Slide

  35. Deep Dive - Failures
    Fix Applied!
    800k failures per hour!

    View Slide

  36. Deep Dive – What Happened?
    • High usage service had increasing number of instances and unique
    metrics generated over time
    • Temporal Index lookup was VERY expensive
    • For a SINGLE Dashboard open on a developers desktop!
    • Impacted SR across all of our alerting, from one user
    • Updated User’s Dashboard, Impact Mitigated

    View Slide

  37. Deep Dive – What Happened? There’s MORE
    • This isn’t an acceptable solution
    • Something else is wrong here, we have caching in place
    • The code was reviewed, tested, improved things in the past
    • This shouldn’t have been a big issue

    View Slide

  38. Deep Dive – In Memory Cache
    What’s off here?

    View Slide

  39. Deep Dive – In Memory Cache
    What’s off here?
    Not blocking other puts
    Expensive calculation
    GC Spiral continues if cache
    doesn’t fill in time

    View Slide

  40. Bad Cache Behavior
    •processing...
    get(a)
    •processing...
    get(a)
    •processing...
    get(a)
    •processing...
    get(a)
    •processing...
    get(a)
    get(a)
    1st cache retrieval!
    1st insert into
    cache!

    View Slide

  41. Deep Dive – In Memory Cache (fixed)
    Always go to the cache,
    Load new value if not
    present
    Block if loading

    View Slide

  42. Better Cache Behavior
    Future[Value]
    get(a)
    get(a)
    get(a)
    get(a)
    get(a)
    get(a)
    get(a)
    get(a)
    get(a)
    get(a
    get(a)

    View Slide

  43. Deep Dive – System Memory % Used
    Fix Rolled Out!
    ~30% more headroom

    View Slide

  44. View Slide

  45. View Slide

  46. View Slide