Consistency in Observability of Microservices at Lyft (Monitorama 2017 PDX)

Yann Ramin

May 24, 2017

  1. Consistency in Monitoring Observability of Microservices @ Fighting Entropy and

    Delaying the Heat Death of the Universe Yann Ramin @theatrus - yramin@lyft.com
  2. Monitorama 2017 PDX / @theatrus Hi! I’m Yann 

    Engineering Manager
 Observability Team @ Lyft
 Logging, TSDBs, performance profiling, low-level infrastructure
 Previously from the world of embedded systems
 “C is the best programming language”
  3. Monitorama 2017 PDX / @theatrus Approaches and techniques to avoid

    production incidents with hundreds of micro services and diverse teams
  4. What are we trying to solve? When developers are on-call

    with microservices or scaling operational mindfulness
  5. Monitorama 2017 PDX / @theatrus “I didn’t realize it was

    logging errors all weekend.”
  6. Monitorama 2017 PDX / @theatrus “I clicked production deploy and

    Jenkins went green! [closes laptop, goes home]” This is an opportunity for excellence
  7. Monitorama 2017 PDX / @theatrus “No one setup a PagerDuty

    rotation before going to production!”
  8. Monitorama 2017 PDX / @theatrus “We need alarms on thingX!

    Lets copy and paste them from my last service!”
  9. Monitorama 2017 PDX / @theatrus “I have no idea what

    is broken.”
  10. Monitorama 2017 PDX / @theatrus We routinely approach monitoring as

  11. Monitorama 2017 PDX / @theatrus We need to abstract and

  12. Some backstory Consistency leads to consistency

  13. Monitorama 2017 PDX / @theatrus ~ Today Python, PHP, and

    Go Common base libraries for each language Hundreds of micro services, no monorepo* Deploys frequently, sometimes “never” Common “base” deploy, Salt (masterless), AWS
  14. Monitorama 2017 PDX / @theatrus DevOps? Teams on-call for their

    services No operations No SRE Infrastructure team enables, not operates
  15. System Metrics and Pipeline Get the basics right

  16. Monitorama 2017 PDX / @theatrus Make it easy to produce

    metrics (everywhere) (safely)
  17. Monitorama 2017 PDX / @theatrus Big Secret ! We use

    the statsd protocol
  18. Monitorama 2017 PDX / @theatrus Comfortable Compartmentalization

  19. Monitorama 2017 PDX / @theatrus Lyft’s Pipeline Setup Cascaded github.com/lyft/statsrelay

 github.com/lyft/statsite ➡ TSDBs

  20. Monitorama 2017 PDX / @theatrus Data you get Service level

    aggregates centrally (correct histograms)
 Per host data processed locally Default 60 second period(!), option for 1 second
  21. Monitorama 2017 PDX / @theatrus Monitor yourself Send to central

    aggregation Local stats
  22. Monitorama 2017 PDX / @theatrus So many events

  23. Monitorama 2017 PDX / @theatrus So many events Billions per

    second, with aggregation/sampling This is only ~200k metrics per seconds, thanks to rollups Per-instance cardinality limits Opt-in mechanisms for per-host and per-second data
  24. Monitorama 2017 PDX / @theatrus System metrics CollectD Custom scripts

    Even some bash functions All to local statsrelay
  25. Monitorama 2017 PDX / @theatrus What do we get? Comprehensive

    system metrics All metric producers are sandboxed, rate limited, and monitored No UDP spam over the network LIFO queueing
  26. Monitorama 2017 PDX / @theatrus Instrument the core libraries

  27. Monitorama 2017 PDX / @theatrus Problem Developers add metrics after

    it already broke once Adding metrics on core functions is not DRY Not all developers think in terms of metrics
  28. Monitorama 2017 PDX / @theatrus class StatsdHandler(logging.Handler) every log call

    now produces metrics scoped by severity!
  29. Monitorama 2017 PDX / @theatrus Instrument the RPC Record RPC

    inbound errors, successes, timings in core servers 
 Gunicorn + https://pypi.python.org/pypi/blinker + statsd
  30. Monitorama 2017 PDX / @theatrus Don’t be afraid of monkey

  31. Monitorama 2017 PDX / @theatrus gunicorn.conf

  32. Monitorama 2017 PDX / @theatrus It’s also the worst. (not

  33. Monitorama 2017 PDX / @theatrus

  34. Monitorama 2017 PDX / @theatrus

  35. Monitorama 2017 PDX / @theatrus What do we get? Consistent

    Measurement Consistent Visibility Point to Point Debugging Unified Tracing
  36. Deploy Time Standards

  37. Monitorama 2017 PDX / @theatrus orca Salt module for “Orchestration”

    Provisions all remote resources a service needs during deploy Interacts with PagerDuty, makes sure a PD service is created Makes sure an on-call schedule is associated Otherwise blocks production deploy
  38. Application Metrics dtrt, easily

  39. Monitorama 2017 PDX / @theatrus Python from lyft_stats.stats import get_stats

    stat_handler = get_stats(‘my_prefix’).scope(‘foo’)
 stat_handler.incr('counter.example', 10, tags={'bar': 'baz'})
 stat_handler.timer('sample.timer', 10.0, per_host=True)
  40. Monitorama 2017 PDX / @theatrus Go https://github.com/lyft/gostats

  41. Monitorama 2017 PDX / @theatrus We also don’t discuss PHP.

  42. Monitorama 2017 PDX / @theatrus Gripes statsd calls are not

    cheap (syscalls) Various workarounds (.pipeline())
  43. Monitorama 2017 PDX / @theatrus Further Improvements libtelem - native

    library to handle consistent instrumentation, shared memory views Unified tracing, logging and metrics
  44. Unified Data Lots of cron scripts and Grafana

  45. Monitorama 2017 PDX / @theatrus Multiple System Syndrome CloudWatch is

    the only source for some data CloudWatch “Hold on let me log in” “My MFA token doesn’t work, can someone else log in?” Using different systems is distracting, delays debugging
  46. Monitorama 2017 PDX / @theatrus Less tools The less tools

    for triage, the better Context switching is expensive Put everything up front, or one click away
  47. Monitorama 2017 PDX / @theatrus Either federate (Grafana plugins) Or

    just copy it
  48. dashboards dot git dot cat?

  49. Monitorama 2017 PDX / @theatrus Central Monitoring and Alarm Hub

    Git Monorepo Ties in with our Salt infrastructure Dashboards defined as Salt states, deploys like a service Iteration to staging Grafana Manages Grafana, Wavefront, other services
  50. Monitorama 2017 PDX / @theatrus Every service gets a default

    dashboard with base alarms
  51. Monitorama 2017 PDX / @theatrus Add Services define extra resources

    they use in a Salt pillar Can also define custom rows and targets
  52. Monitorama 2017 PDX / @theatrus

  53. Monitorama 2017 PDX / @theatrus Infrastructure or Dependent Teams Own

    the Monitoring
  54. Monitorama 2017 PDX / @theatrus Consistent Look and Feel Same

    rows, same metrics, approachable
  55. Monitorama 2017 PDX / @theatrus Code Dashboard/Alarm Review

  56. Monitorama 2017 PDX / @theatrus grep

  57. Monitorama 2017 PDX / @theatrus Other advantages Alarms exist on

    dashboards Less copy and paste Global refactoring Customize alarms without learning about all the nuances
  58. Monitorama 2017 PDX / @theatrus Even more features Contains a

    query parser and rewriter
 (plug for pyPEG2) Parse query, transform queries into alternate forms Generate “deploy” (canaries vs. production) dashboards Automatic staging environment dashboards Best practices lint
  59. Monitorama 2017 PDX / @theatrus

  60. Monitorama 2017 PDX / @theatrus What Sucks Grafana has a

    UI builder We’re making you write dashboards as Jinja2+YAML Small workaround tooling: Grafana management script
 python tools/manage_dashboard.py
  61. Monitorama 2017 PDX / @theatrus What Sucks Monorepo was great

    to bootstrap and iterate Poor in all the classic monorepo ways
  62. Monitorama 2017 PDX / @theatrus Query languages, alarming, and UX

    far too difficult For non-experts, and we shouldn’t expect our
 users to become experts in the field
  63. Enrichment

  64. Monitorama 2017 PDX / @theatrus

  65. Trust and Self Defense Will the dashboard load? Can I

    trust the data? Can you take it down?
  66. Monitorama 2017 PDX / @theatrus

  67. Monitorama 2017 PDX / @theatrus Provide visibility, out of band,

    of monitoring health Updates every 15s Updates … a few hours later?
  68. Monitorama 2017 PDX / @theatrus Add smart limits lyft/statsrelay samples/aggregates

    to limit maximum outgoing statsd event rate Limits cardinality of counters/timers/gauges
  69. Monitorama 2017 PDX / @theatrus We’re not done yet. (Is

    anything ever done?)
  70. Monitorama 2017 PDX / @theatrus