Measuring the
infrastructure resiliency
@twalex
Alex Yarmula
Slide 2
Slide 2 text
Monitoring and infrastructure
• Monitoring is Tier 0: can’t be less available than the
systems under monitoring
• Therefore, can’t rely on systems you’re monitoring
to be part of your infrastructure
• Understand which things have to be in your control
Slide 3
Slide 3 text
Getting the metrics out
• Instrument the code with metrics directly
• higher chance of capturing the important knowledge than after-the-fact metrics
• when the logic changes, it’s reflected in metrics right away
• Don’t make people explicitly add/register metrics
Slide 4
Slide 4 text
Getting the metrics in
• What’s the metric -> monitoring path?
• Don’t onboard services and hosts onto your
monitoring stack - make it automatic
• Choose between pushing metrics vs polling metrics
• Maximizing the control over sending while
minimizing failure scenarios
• Centralized collectors vs decentralized agents
Slide 5
Slide 5 text
Discovering the data
• Assign a name and address for each piece of data
gathered
“for all front end servers, sum 15-min load avg”
• Ability to query the data without knowing where it
resides
• Ability to perform maintenances, move data
sources
Slide 6
Slide 6 text
Discovering the data (cont.)
Decouple monitoring from the knowledge about
instances:
“look at all the backend Riak metrics”
vs
“see metrics coming from my-riak-01.db.startup.ca,
my-riak-02.db.startup.ca, my-riak-03.db.startup.ca”
Slide 7
Slide 7 text
Discovering events
• Metrics are only the projection of reality into your
measuring devices
• Keep track of higher-level events to provide context
for the metrics
Slide 8
Slide 8 text
Going faster
• Invest in high-frequency metrics (1s, 10s)
• Doesn’t have to go through the main monitoring path
• can stream through WebSockets
• can have high-frequency collection
• High frequency != low latency
• consider store -> batch -> forward
Slide 9
Slide 9 text
Going faster
Slide 10
Slide 10 text
Visualizations
“Go to these pages and look at the graphs”
vs
Mathematica-style “enter your functions”
• “Live” graphs: auto-update
periodically as the data arrives
• Allow overlaying time-series
• Consider log scale to compare results
of different magnitudes
Visualizations
Slide 14
Slide 14 text
Problems at Twitter scale
Slide 15
Slide 15 text
Quality of service
• Your workload is write-many read-few
• Protect against heavy reads: evaluate per-query
costs, kill expensive queries, track hot requests
• Protect against heavy writes: impose quotas,
prioritize or drop
Slide 16
Slide 16 text
Quality of service (cont.)
• Full disclosure: this is where we’re heading now
• Some metrics are more equal than others. If you
know which ones that are more important, you can
protect them better
• At a certain size, you can’t put all of your metrics in
one place. Have to isolate for reliability, then
federate
Slide 17
Slide 17 text
Quality of service (cont.)
• Do all read queries have the same SLA? Can some
be answered in minutes and not seconds?
• Can some writes be aggregated offline? Can they
be approximated and then improved (e.g. Lambda
Architecture model)
Slide 18
Slide 18 text
Consistency
• Do alerts and dashboards have to be separate?
• In a metrics-driven organization, how does one
discover metrics that matter the most?
• In a service-oriented environment, who monitors
your dependencies? Who gets paged?
Slide 19
Slide 19 text
Configuration
• In the service-oriented architecture, it’s tempting to
configure each system separately
• Filtering metrics, aggregating metrics on multiple
levels adds to complexity
• Attempt to consolidate the most important pieces of
configuration early on
Slide 20
Slide 20 text
Line charts and scale
vs
Slide 21
Slide 21 text
Summary
• Invest in reliability of your monitoring stack to
increase reliability of your company’s services
• Visualize to reduce cognitive cost
• Optimize for the shortest path to get to the data
• Make an effort to get to high-frequency data
Slide 22
Slide 22 text
Thanks!
• My team is hiring, come talk to us later
twitter.com/jobs
• There are a few Observability team members here
tonight
• Share your monitoring experiences with us!