Consistency
in Monitoring Observability of
Microservices @
Fighting Entropy and Delaying the Heat Death of the Universe
Yann Ramin
@theatrus - [email protected]
Slide 2
Slide 2 text
Monitorama 2017 PDX / @theatrus
Hi!
I’m Yann
@theatrus
Engineering Manager
Observability Team @ Lyft
Logging, TSDBs, performance profiling, low-level infrastructure
Previously from the world of embedded systems
“C is the best programming language”
Slide 3
Slide 3 text
Monitorama 2017 PDX / @theatrus
Approaches and techniques to
avoid production incidents with
hundreds of micro services and
diverse teams
Slide 4
Slide 4 text
What are we trying to
solve?
When developers are on-call with microservices
or
scaling operational mindfulness
Slide 5
Slide 5 text
Monitorama 2017 PDX / @theatrus
“I didn’t realize it was logging errors all
weekend.”
Slide 6
Slide 6 text
Monitorama 2017 PDX / @theatrus
“I clicked production deploy and Jenkins
went green! [closes laptop, goes home]”
This is an opportunity
for excellence
Slide 7
Slide 7 text
Monitorama 2017 PDX / @theatrus
“No one setup a PagerDuty rotation
before going to production!”
Slide 8
Slide 8 text
Monitorama 2017 PDX / @theatrus
“We need alarms on thingX! Lets copy
and paste them from my last service!”
Slide 9
Slide 9 text
Monitorama 2017 PDX / @theatrus
“I have no idea what is broken.”
Slide 10
Slide 10 text
Monitorama 2017 PDX / @theatrus
We routinely approach monitoring as operations
Slide 11
Slide 11 text
Monitorama 2017 PDX / @theatrus
We need to abstract and modularize
Slide 12
Slide 12 text
Some backstory
Consistency leads to consistency
Slide 13
Slide 13 text
Monitorama 2017 PDX / @theatrus
~ Today
Python, PHP, and Go
Common base libraries for each language
Hundreds of micro services, no monorepo*
Deploys frequently, sometimes “never”
Common “base” deploy, Salt (masterless), AWS
Slide 14
Slide 14 text
Monitorama 2017 PDX / @theatrus
DevOps?
Teams on-call for their services
No operations
No SRE
Infrastructure team enables, not operates
Slide 15
Slide 15 text
System Metrics and
Pipeline
Get the basics right
Slide 16
Slide 16 text
Monitorama 2017 PDX / @theatrus
Make it easy to produce metrics
(everywhere)
(safely)
Slide 17
Slide 17 text
Monitorama 2017 PDX / @theatrus
Big Secret
!
We use the statsd protocol
Monitorama 2017 PDX / @theatrus
Data you get
Service level aggregates centrally (correct histograms)
Per host data processed locally
Default 60 second period(!), option for 1 second
Slide 21
Slide 21 text
Monitorama 2017 PDX / @theatrus
Monitor yourself
Send to central aggregation
Local stats
Slide 22
Slide 22 text
Monitorama 2017 PDX / @theatrus
So many events
Slide 23
Slide 23 text
Monitorama 2017 PDX / @theatrus
So many events
Billions per second, with aggregation/sampling
This is only ~200k metrics per seconds, thanks to rollups
Per-instance cardinality limits
Opt-in mechanisms for per-host and per-second data
Slide 24
Slide 24 text
Monitorama 2017 PDX / @theatrus
System metrics
CollectD
Custom scripts
Even some bash functions
All to local statsrelay
Slide 25
Slide 25 text
Monitorama 2017 PDX / @theatrus
What do we get?
Comprehensive system metrics
All metric producers are sandboxed, rate limited, and
monitored
No UDP spam over the network
LIFO queueing
Slide 26
Slide 26 text
Monitorama 2017 PDX / @theatrus
Instrument the core libraries
Slide 27
Slide 27 text
Monitorama 2017 PDX / @theatrus
Problem
Developers add metrics after it already broke once
Adding metrics on core functions is not DRY
Not all developers think in terms of metrics
Slide 28
Slide 28 text
Monitorama 2017 PDX / @theatrus
class StatsdHandler(logging.Handler)
every log call now produces metrics scoped by severity!
Slide 29
Slide 29 text
Monitorama 2017 PDX / @theatrus
Instrument the RPC
Record RPC inbound errors, successes, timings in core servers
Gunicorn + https://pypi.python.org/pypi/blinker + statsd
Slide 30
Slide 30 text
Monitorama 2017 PDX / @theatrus
Don’t be afraid of monkey
patching
Slide 31
Slide 31 text
Monitorama 2017 PDX / @theatrus
gunicorn.conf
Slide 32
Slide 32 text
Monitorama 2017 PDX / @theatrus
It’s also the worst.
(not gunicorn)
Slide 33
Slide 33 text
Monitorama 2017 PDX / @theatrus
Slide 34
Slide 34 text
Monitorama 2017 PDX / @theatrus
Slide 35
Slide 35 text
Monitorama 2017 PDX / @theatrus
What do we get?
Consistent Measurement
Consistent Visibility
Point to Point Debugging
Unified Tracing
Slide 36
Slide 36 text
Deploy Time Standards
Slide 37
Slide 37 text
Monitorama 2017 PDX / @theatrus
orca
Salt module for “Orchestration”
Provisions all remote resources a service needs during deploy
Interacts with PagerDuty, makes sure a PD service is created
Makes sure an on-call schedule is associated
Otherwise blocks production deploy
Monitorama 2017 PDX / @theatrus
Go
https://github.com/lyft/gostats
Slide 41
Slide 41 text
Monitorama 2017 PDX / @theatrus
We also don’t discuss PHP.
Slide 42
Slide 42 text
Monitorama 2017 PDX / @theatrus
Gripes
statsd calls are not cheap (syscalls)
Various workarounds (.pipeline())
Slide 43
Slide 43 text
Monitorama 2017 PDX / @theatrus
Further Improvements
libtelem - native library to handle consistent
instrumentation, shared memory views
Unified tracing, logging and metrics
Slide 44
Slide 44 text
Unified Data
Lots of cron scripts and Grafana
Slide 45
Slide 45 text
Monitorama 2017 PDX / @theatrus
Multiple System Syndrome
CloudWatch is the only source for some data
CloudWatch
“Hold on let me log in”
“My MFA token doesn’t work, can someone else log in?”
Using different systems is distracting, delays debugging
Slide 46
Slide 46 text
Monitorama 2017 PDX / @theatrus
Less tools
The less tools for triage, the better
Context switching is expensive
Put everything up front, or one click away
Slide 47
Slide 47 text
Monitorama 2017 PDX / @theatrus
Either federate (Grafana plugins)
Or just copy it
Slide 48
Slide 48 text
dashboards
dot git
dot cat?
Slide 49
Slide 49 text
Monitorama 2017 PDX / @theatrus
Central Monitoring and
Alarm Hub
Git Monorepo
Ties in with our Salt infrastructure
Dashboards defined as Salt states, deploys like a service
Iteration to staging Grafana
Manages Grafana, Wavefront, other services
Slide 50
Slide 50 text
Monitorama 2017 PDX / @theatrus
Every service gets a default
dashboard with base alarms
Slide 51
Slide 51 text
Monitorama 2017 PDX / @theatrus
Add
Services define extra resources they use in a Salt pillar
Can also define custom rows and targets
Slide 52
Slide 52 text
Monitorama 2017 PDX / @theatrus
Slide 53
Slide 53 text
Monitorama 2017 PDX / @theatrus
Infrastructure or Dependent Teams Own
the Monitoring
Slide 54
Slide 54 text
Monitorama 2017 PDX / @theatrus
Consistent Look and Feel
Same rows, same metrics, approachable
Monitorama 2017 PDX / @theatrus
Other advantages
Alarms exist on dashboards
Less copy and paste
Global refactoring
Customize alarms without learning about all the nuances
Slide 58
Slide 58 text
Monitorama 2017 PDX / @theatrus
Even more features
Contains a query parser and rewriter
(plug for pyPEG2)
Parse query, transform queries into alternate forms
Generate “deploy” (canaries vs. production) dashboards
Automatic staging environment dashboards
Best practices lint
Slide 59
Slide 59 text
Monitorama 2017 PDX / @theatrus
Slide 60
Slide 60 text
Monitorama 2017 PDX / @theatrus
What Sucks
Grafana has a UI builder
We’re making you write dashboards as Jinja2+YAML
Small workaround tooling: Grafana management script
python tools/manage_dashboard.py
Slide 61
Slide 61 text
Monitorama 2017 PDX / @theatrus
What Sucks
Monorepo was great to bootstrap and iterate
Poor in all the classic monorepo ways
Slide 62
Slide 62 text
Monitorama 2017 PDX / @theatrus
Query languages, alarming, and UX far too difficult
For non-experts, and we shouldn’t expect our
users to become experts in the field
Slide 63
Slide 63 text
Enrichment
Slide 64
Slide 64 text
Monitorama 2017 PDX / @theatrus
Slide 65
Slide 65 text
Trust and Self Defense
Will the dashboard load?
Can I trust the data?
Can you take it down?
Slide 66
Slide 66 text
Monitorama 2017 PDX / @theatrus
Slide 67
Slide 67 text
Monitorama 2017 PDX / @theatrus
Provide visibility, out of band,
of monitoring health
Updates every 15s Updates … a few hours later?
Slide 68
Slide 68 text
Monitorama 2017 PDX / @theatrus
Add smart limits
lyft/statsrelay samples/aggregates to limit maximum outgoing
statsd event rate
Limits cardinality of counters/timers/gauges
my.metric.0x293fa3a93…