Consistency in Observability of Microservices at Lyft (Monitorama 2017 PDX)

Slide 1

Slide 1 text

Consistency in Monitoring Observability of Microservices @ Fighting Entropy and Delaying the Heat Death of the Universe Yann Ramin @theatrus - [email protected]

Slide 2

Slide 2 text

Monitorama 2017 PDX / @theatrus Hi! I’m Yann   @theatrus Engineering Manager  Observability Team @ Lyft  Logging, TSDBs, performance proﬁling, low-level infrastructure    Previously from the world of embedded systems  “C is the best programming language”

Slide 3

Slide 3 text

Monitorama 2017 PDX / @theatrus Approaches and techniques to avoid production incidents with hundreds of micro services and diverse teams

Slide 4

Slide 4 text

What are we trying to solve? When developers are on-call with microservices or scaling operational mindfulness

Slide 5

Slide 5 text

Monitorama 2017 PDX / @theatrus “I didn’t realize it was logging errors all weekend.”

Slide 6

Slide 6 text

Monitorama 2017 PDX / @theatrus “I clicked production deploy and Jenkins went green! [closes laptop, goes home]” This is an opportunity for excellence

Slide 7

Slide 7 text

Monitorama 2017 PDX / @theatrus “No one setup a PagerDuty rotation before going to production!”

Slide 8

Slide 8 text

Monitorama 2017 PDX / @theatrus “We need alarms on thingX! Lets copy and paste them from my last service!”

Slide 9

Slide 9 text

Monitorama 2017 PDX / @theatrus “I have no idea what is broken.”

Slide 10

Slide 10 text

Monitorama 2017 PDX / @theatrus We routinely approach monitoring as operations

Slide 11

Slide 11 text

Monitorama 2017 PDX / @theatrus We need to abstract and modularize

Slide 12

Slide 12 text

Some backstory Consistency leads to consistency

Slide 13

Slide 13 text

Monitorama 2017 PDX / @theatrus ~ Today Python, PHP, and Go Common base libraries for each language Hundreds of micro services, no monorepo* Deploys frequently, sometimes “never” Common “base” deploy, Salt (masterless), AWS

Slide 14

Slide 14 text

Monitorama 2017 PDX / @theatrus DevOps? Teams on-call for their services No operations No SRE Infrastructure team enables, not operates

Slide 15

Slide 15 text

System Metrics and Pipeline Get the basics right

Slide 16

Slide 16 text

Monitorama 2017 PDX / @theatrus Make it easy to produce metrics (everywhere) (safely)

Slide 17

Slide 17 text

Monitorama 2017 PDX / @theatrus Big Secret ! We use the statsd protocol

Slide 18

Slide 18 text

Monitorama 2017 PDX / @theatrus Comfortable Compartmentalization

Slide 19

Slide 19 text

Monitorama 2017 PDX / @theatrus Lyft’s Pipeline Setup Cascaded github.com/lyft/statsrelay +  github.com/lyft/statsite ➡ TSDBs   

Slide 20

Slide 20 text

Monitorama 2017 PDX / @theatrus Data you get Service level aggregates centrally (correct histograms)  Per host data processed locally Default 60 second period(!), option for 1 second

Slide 21

Slide 21 text

Monitorama 2017 PDX / @theatrus Monitor yourself Send to central aggregation Local stats

Slide 22

Slide 22 text

Monitorama 2017 PDX / @theatrus So many events

Slide 23

Slide 23 text

Monitorama 2017 PDX / @theatrus So many events Billions per second, with aggregation/sampling This is only ~200k metrics per seconds, thanks to rollups Per-instance cardinality limits Opt-in mechanisms for per-host and per-second data

Slide 24

Slide 24 text

Monitorama 2017 PDX / @theatrus System metrics CollectD Custom scripts Even some bash functions All to local statsrelay

Slide 25

Slide 25 text

Monitorama 2017 PDX / @theatrus What do we get? Comprehensive system metrics All metric producers are sandboxed, rate limited, and monitored No UDP spam over the network LIFO queueing

Slide 26

Slide 26 text

Monitorama 2017 PDX / @theatrus Instrument the core libraries

Slide 27

Slide 27 text

Monitorama 2017 PDX / @theatrus Problem Developers add metrics after it already broke once Adding metrics on core functions is not DRY Not all developers think in terms of metrics

Slide 28

Slide 28 text

Monitorama 2017 PDX / @theatrus class StatsdHandler(logging.Handler) every log call now produces metrics scoped by severity!

Slide 29

Slide 29 text

Monitorama 2017 PDX / @theatrus Instrument the RPC Record RPC inbound errors, successes, timings in core servers   Gunicorn + https://pypi.python.org/pypi/blinker + statsd

Slide 30

Slide 30 text

Monitorama 2017 PDX / @theatrus Don’t be afraid of monkey patching

Slide 31

Slide 31 text

Monitorama 2017 PDX / @theatrus gunicorn.conf

Slide 32

Slide 32 text

Monitorama 2017 PDX / @theatrus It’s also the worst. (not gunicorn)

Slide 33

Slide 33 text

Monitorama 2017 PDX / @theatrus

Slide 34

Slide 34 text

Monitorama 2017 PDX / @theatrus

Slide 35

Slide 35 text

Monitorama 2017 PDX / @theatrus What do we get? Consistent Measurement Consistent Visibility Point to Point Debugging Uniﬁed Tracing

Slide 36

Slide 36 text

Deploy Time Standards

Slide 37

Slide 37 text

Monitorama 2017 PDX / @theatrus orca Salt module for “Orchestration” Provisions all remote resources a service needs during deploy Interacts with PagerDuty, makes sure a PD service is created Makes sure an on-call schedule is associated Otherwise blocks production deploy

Slide 38

Slide 38 text

Application Metrics dtrt, easily

Slide 39

Slide 39 text

Monitorama 2017 PDX / @theatrus Python from lyft_stats.stats import get_stats  stat_handler = get_stats(‘my_prefix’).scope(‘foo’)    stat_handler.incr('foo.counter')    stat_handler.incr('counter.example', 10, tags={'bar': 'baz'})    stat_handler.timer('sample.timer', 10.0, per_host=True)

Slide 40

Slide 40 text

Monitorama 2017 PDX / @theatrus Go https://github.com/lyft/gostats

Slide 41

Slide 41 text

Monitorama 2017 PDX / @theatrus We also don’t discuss PHP.

Slide 42

Slide 42 text

Monitorama 2017 PDX / @theatrus Gripes statsd calls are not cheap (syscalls) Various workarounds (.pipeline())

Slide 43

Slide 43 text

Monitorama 2017 PDX / @theatrus Further Improvements libtelem - native library to handle consistent instrumentation, shared memory views Uniﬁed tracing, logging and metrics

Slide 44

Slide 44 text

Uniﬁed Data Lots of cron scripts and Grafana

Slide 45

Slide 45 text

Monitorama 2017 PDX / @theatrus Multiple System Syndrome CloudWatch is the only source for some data CloudWatch “Hold on let me log in” “My MFA token doesn’t work, can someone else log in?” Using diﬀerent systems is distracting, delays debugging

Slide 46

Slide 46 text

Monitorama 2017 PDX / @theatrus Less tools The less tools for triage, the better Context switching is expensive Put everything up front, or one click away

Slide 47

Slide 47 text

Monitorama 2017 PDX / @theatrus Either federate (Grafana plugins) Or just copy it

Slide 48

Slide 48 text

dashboards dot git dot cat?

Slide 49

Slide 49 text

Monitorama 2017 PDX / @theatrus Central Monitoring and Alarm Hub Git Monorepo Ties in with our Salt infrastructure Dashboards deﬁned as Salt states, deploys like a service Iteration to staging Grafana Manages Grafana, Wavefront, other services

Slide 50

Slide 50 text

Monitorama 2017 PDX / @theatrus Every service gets a default dashboard with base alarms

Slide 51

Slide 51 text

Monitorama 2017 PDX / @theatrus Add Services deﬁne extra resources they use in a Salt pillar Can also deﬁne custom rows and targets

Slide 52

Slide 52 text

Monitorama 2017 PDX / @theatrus

Slide 53

Slide 53 text

Monitorama 2017 PDX / @theatrus Infrastructure or Dependent Teams Own the Monitoring

Slide 54

Slide 54 text

Monitorama 2017 PDX / @theatrus Consistent Look and Feel Same rows, same metrics, approachable

Slide 55

Slide 55 text

Monitorama 2017 PDX / @theatrus Code Dashboard/Alarm Review

Slide 56

Slide 56 text

Monitorama 2017 PDX / @theatrus grep

Slide 57

Slide 57 text

Monitorama 2017 PDX / @theatrus Other advantages Alarms exist on dashboards Less copy and paste Global refactoring Customize alarms without learning about all the nuances

Slide 58

Slide 58 text

Monitorama 2017 PDX / @theatrus Even more features Contains a query parser and rewriter  (plug for pyPEG2) Parse query, transform queries into alternate forms Generate “deploy” (canaries vs. production) dashboards Automatic staging environment dashboards Best practices lint

Slide 59

Slide 59 text

Monitorama 2017 PDX / @theatrus

Slide 60

Slide 60 text

Monitorama 2017 PDX / @theatrus What Sucks Grafana has a UI builder We’re making you write dashboards as Jinja2+YAML Small workaround tooling: Grafana management script  python tools/manage_dashboard.py

Slide 61

Slide 61 text

Monitorama 2017 PDX / @theatrus What Sucks Monorepo was great to bootstrap and iterate Poor in all the classic monorepo ways

Slide 62

Slide 62 text

Monitorama 2017 PDX / @theatrus Query languages, alarming, and UX far too diﬃcult For non-experts, and we shouldn’t expect our  users to become experts in the ﬁeld

Slide 63

Slide 63 text

Enrichment

Slide 64

Slide 64 text

Monitorama 2017 PDX / @theatrus

Slide 65

Slide 65 text

Trust and Self Defense Will the dashboard load? Can I trust the data? Can you take it down?

Slide 66

Slide 66 text

Monitorama 2017 PDX / @theatrus

Slide 67

Slide 67 text

Monitorama 2017 PDX / @theatrus Provide visibility, out of band, of monitoring health Updates every 15s Updates … a few hours later?

Slide 68

Slide 68 text

Monitorama 2017 PDX / @theatrus Add smart limits lyft/statsrelay samples/aggregates to limit maximum outgoing statsd event rate Limits cardinality of counters/timers/gauges  my.metric.0x293fa3a93…