Consistency in Observability of Microservices at Lyft (Monitorama 2017 PDX)

Consistency in Monitoring Observability of Microservices @ Fighting Entropy and
Delaying the Heat Death of the Universe Yann Ramin @theatrus - [email protected]

Monitorama 2017 PDX / @theatrus Hi! I’m Yann   @theatrus
Engineering Manager  Observability Team @ Lyft  Logging, TSDBs, performance proﬁling, low-level infrastructure    Previously from the world of embedded systems  “C is the best programming language”

Monitorama 2017 PDX / @theatrus Approaches and techniques to avoid
production incidents with hundreds of micro services and diverse teams

What are we trying to solve? When developers are on-call
with microservices or scaling operational mindfulness

Monitorama 2017 PDX / @theatrus “I didn’t realize it was
logging errors all weekend.”

Monitorama 2017 PDX / @theatrus “I clicked production deploy and
Jenkins went green! [closes laptop, goes home]” This is an opportunity for excellence

Monitorama 2017 PDX / @theatrus “No one setup a PagerDuty
rotation before going to production!”

Monitorama 2017 PDX / @theatrus “We need alarms on thingX!
Lets copy and paste them from my last service!”

Monitorama 2017 PDX / @theatrus “I have no idea what
is broken.”

Monitorama 2017 PDX / @theatrus We routinely approach monitoring as
operations

Monitorama 2017 PDX / @theatrus We need to abstract and
modularize

Some backstory Consistency leads to consistency

Monitorama 2017 PDX / @theatrus ~ Today Python, PHP, and
Go Common base libraries for each language Hundreds of micro services, no monorepo* Deploys frequently, sometimes “never” Common “base” deploy, Salt (masterless), AWS

Monitorama 2017 PDX / @theatrus DevOps? Teams on-call for their
services No operations No SRE Infrastructure team enables, not operates

System Metrics and Pipeline Get the basics right

Monitorama 2017 PDX / @theatrus Make it easy to produce
metrics (everywhere) (safely)

Monitorama 2017 PDX / @theatrus Big Secret ! We use
the statsd protocol

Monitorama 2017 PDX / @theatrus Comfortable Compartmentalization

Monitorama 2017 PDX / @theatrus Lyft’s Pipeline Setup Cascaded github.com/lyft/statsrelay
+  github.com/lyft/statsite ➡ TSDBs   

Monitorama 2017 PDX / @theatrus Data you get Service level
aggregates centrally (correct histograms)  Per host data processed locally Default 60 second period(!), option for 1 second

Monitorama 2017 PDX / @theatrus Monitor yourself Send to central
aggregation Local stats

Monitorama 2017 PDX / @theatrus So many events

Monitorama 2017 PDX / @theatrus So many events Billions per
second, with aggregation/sampling This is only ~200k metrics per seconds, thanks to rollups Per-instance cardinality limits Opt-in mechanisms for per-host and per-second data

Monitorama 2017 PDX / @theatrus System metrics CollectD Custom scripts
Even some bash functions All to local statsrelay

Monitorama 2017 PDX / @theatrus What do we get? Comprehensive
system metrics All metric producers are sandboxed, rate limited, and monitored No UDP spam over the network LIFO queueing

Monitorama 2017 PDX / @theatrus Instrument the core libraries

Monitorama 2017 PDX / @theatrus Problem Developers add metrics after
it already broke once Adding metrics on core functions is not DRY Not all developers think in terms of metrics

Monitorama 2017 PDX / @theatrus class StatsdHandler(logging.Handler) every log call
now produces metrics scoped by severity!

Monitorama 2017 PDX / @theatrus Instrument the RPC Record RPC
inbound errors, successes, timings in core servers   Gunicorn + https://pypi.python.org/pypi/blinker + statsd

Monitorama 2017 PDX / @theatrus Don’t be afraid of monkey
patching

Monitorama 2017 PDX / @theatrus gunicorn.conf

Monitorama 2017 PDX / @theatrus It’s also the worst. (not
gunicorn)

Monitorama 2017 PDX / @theatrus

Monitorama 2017 PDX / @theatrus What do we get? Consistent
Measurement Consistent Visibility Point to Point Debugging Uniﬁed Tracing

Deploy Time Standards

Monitorama 2017 PDX / @theatrus orca Salt module for “Orchestration”
Provisions all remote resources a service needs during deploy Interacts with PagerDuty, makes sure a PD service is created Makes sure an on-call schedule is associated Otherwise blocks production deploy

Application Metrics dtrt, easily

Monitorama 2017 PDX / @theatrus Python from lyft_stats.stats import get_stats 
stat_handler = get_stats(‘my_prefix’).scope(‘foo’)    stat_handler.incr('foo.counter')    stat_handler.incr('counter.example', 10, tags={'bar': 'baz'})    stat_handler.timer('sample.timer', 10.0, per_host=True)

Monitorama 2017 PDX / @theatrus Go https://github.com/lyft/gostats

Monitorama 2017 PDX / @theatrus We also don’t discuss PHP.

Monitorama 2017 PDX / @theatrus Gripes statsd calls are not
cheap (syscalls) Various workarounds (.pipeline())

Monitorama 2017 PDX / @theatrus Further Improvements libtelem - native
library to handle consistent instrumentation, shared memory views Uniﬁed tracing, logging and metrics

Uniﬁed Data Lots of cron scripts and Grafana

Monitorama 2017 PDX / @theatrus Multiple System Syndrome CloudWatch is
the only source for some data CloudWatch “Hold on let me log in” “My MFA token doesn’t work, can someone else log in?” Using diﬀerent systems is distracting, delays debugging

Monitorama 2017 PDX / @theatrus Less tools The less tools
for triage, the better Context switching is expensive Put everything up front, or one click away

Monitorama 2017 PDX / @theatrus Either federate (Grafana plugins) Or
just copy it

dashboards dot git dot cat?

Monitorama 2017 PDX / @theatrus Central Monitoring and Alarm Hub
Git Monorepo Ties in with our Salt infrastructure Dashboards deﬁned as Salt states, deploys like a service Iteration to staging Grafana Manages Grafana, Wavefront, other services

Monitorama 2017 PDX / @theatrus Every service gets a default
dashboard with base alarms

Monitorama 2017 PDX / @theatrus Add Services deﬁne extra resources
they use in a Salt pillar Can also deﬁne custom rows and targets

Monitorama 2017 PDX / @theatrus Infrastructure or Dependent Teams Own
the Monitoring

Monitorama 2017 PDX / @theatrus Consistent Look and Feel Same
rows, same metrics, approachable

Monitorama 2017 PDX / @theatrus Code Dashboard/Alarm Review

Monitorama 2017 PDX / @theatrus grep

Monitorama 2017 PDX / @theatrus Other advantages Alarms exist on
dashboards Less copy and paste Global refactoring Customize alarms without learning about all the nuances

Monitorama 2017 PDX / @theatrus Even more features Contains a
query parser and rewriter  (plug for pyPEG2) Parse query, transform queries into alternate forms Generate “deploy” (canaries vs. production) dashboards Automatic staging environment dashboards Best practices lint

Monitorama 2017 PDX / @theatrus What Sucks Grafana has a
UI builder We’re making you write dashboards as Jinja2+YAML Small workaround tooling: Grafana management script  python tools/manage_dashboard.py

Monitorama 2017 PDX / @theatrus What Sucks Monorepo was great
to bootstrap and iterate Poor in all the classic monorepo ways

Monitorama 2017 PDX / @theatrus Query languages, alarming, and UX
far too diﬃcult For non-experts, and we shouldn’t expect our  users to become experts in the ﬁeld

Enrichment

Trust and Self Defense Will the dashboard load? Can I
trust the data? Can you take it down?

Monitorama 2017 PDX / @theatrus Provide visibility, out of band,
of monitoring health Updates every 15s Updates … a few hours later?

Monitorama 2017 PDX / @theatrus Add smart limits lyft/statsrelay samples/aggregates
to limit maximum outgoing statsd event rate Limits cardinality of counters/timers/gauges  my.metric.0x293fa3a93…

Monitorama 2017 PDX / @theatrus We’re not done yet. (Is
anything ever done?)

Consistency in Observability of Microservices a...

Consistency in Observability of Microservices at Lyft (Monitorama 2017 PDX)

More Decks by Yann Ramin

Other Decks in Technology

Featured

Transcript