Monitoring and Observability @ Twitter

Measuring the infrastructure resiliency @twalex Alex Yarmula

Monitoring and infrastructure • Monitoring is Tier 0: can’t be
less available than the systems under monitoring • Therefore, can’t rely on systems you’re monitoring to be part of your infrastructure • Understand which things have to be in your control

Getting the metrics out • Instrument the code with metrics
directly • higher chance of capturing the important knowledge than after-the-fact metrics • when the logic changes, it’s reﬂected in metrics right away • Don’t make people explicitly add/register metrics

Getting the metrics in • What’s the metric -> monitoring
path? • Don’t onboard services and hosts onto your monitoring stack - make it automatic • Choose between pushing metrics vs polling metrics • Maximizing the control over sending while minimizing failure scenarios • Centralized collectors vs decentralized agents

Discovering the data • Assign a name and address for
each piece of data gathered “for all front end servers, sum 15-min load avg” • Ability to query the data without knowing where it resides • Ability to perform maintenances, move data sources

Discovering the data (cont.) Decouple monitoring from the knowledge about
instances: “look at all the backend Riak metrics” vs “see metrics coming from my-riak-01.db.startup.ca, my-riak-02.db.startup.ca, my-riak-03.db.startup.ca”

Discovering events • Metrics are only the projection of reality
into your measuring devices • Keep track of higher-level events to provide context for the metrics

Going faster • Invest in high-frequency metrics (1s, 10s) •
Doesn’t have to go through the main monitoring path • can stream through WebSockets • can have high-frequency collection • High frequency != low latency • consider store -> batch -> forward

Going faster

Visualizations “Go to these pages and look at the graphs”
vs Mathematica-style “enter your functions”

Example queries expmovingavg(10, 0.9, ts(sum, myservice, myservice.hostgroupA, api_store)) groupby(metric, sum(_),
ts(sum, myservice, members(myservice.hostgroupA), api_store))

Visualizations Side-by-side  data representations:

• “Live” graphs: auto-update periodically as the data arrives •
Allow overlaying time-series • Consider log scale to compare results of different magnitudes Visualizations

Problems at Twitter scale

Quality of service • Your workload is write-many read-few •
Protect against heavy reads: evaluate per-query costs, kill expensive queries, track hot requests • Protect against heavy writes: impose quotas, prioritize or drop

Quality of service (cont.) • Full disclosure: this is where
we’re heading now • Some metrics are more equal than others. If you know which ones that are more important, you can protect them better • At a certain size, you can’t put all of your metrics in one place. Have to isolate for reliability, then federate

Quality of service (cont.) • Do all read queries have
the same SLA? Can some be answered in minutes and not seconds? • Can some writes be aggregated ofﬂine? Can they be approximated and then improved (e.g. Lambda Architecture model)

Consistency • Do alerts and dashboards have to be separate?
• In a metrics-driven organization, how does one discover metrics that matter the most? • In a service-oriented environment, who monitors your dependencies? Who gets paged?

Configuration • In the service-oriented architecture, it’s tempting to configure
each system separately • Filtering metrics, aggregating metrics on multiple levels adds to complexity • Attempt to consolidate the most important pieces of configuration early on

Line charts and scale vs

Summary • Invest in reliability of your monitoring stack to
increase reliability of your company’s services • Visualize to reduce cognitive cost • Optimize for the shortest path to get to the data • Make an effort to get to high-frequency data

Thanks! • My team is hiring, come talk to us
later twitter.com/jobs • There are a few Observability team members here tonight • Share your monitoring experiences with us!

Monitoring and Observability @ Twitter

Monitoring and Observability @ Twitter

Alex Yarmula

Other Decks in Programming

Featured

Transcript

Measuring the infrastructure resiliency @twalex Alex Yarmula

Monitoring and infrastructure • Monitoring is Tier 0: can’t be

Getting the metrics out • Instrument the code with metrics

Getting the metrics in • What’s the metric -> monitoring

Discovering the data • Assign a name and address for

Discovering the data (cont.) Decouple monitoring from the knowledge about

Discovering events • Metrics are only the projection of reality

Going faster • Invest in high-frequency metrics (1s, 10s) •

Going faster

Visualizations “Go to these pages and look at the graphs”

Example queries expmovingavg(10, 0.9, ts(sum, myservice, myservice.hostgroupA, api_store)) groupby(metric, sum(_),

Visualizations Side-by-side  data representations:

• “Live” graphs: auto-update periodically as the data arrives •

Problems at Twitter scale

Quality of service • Your workload is write-many read-few •

Quality of service (cont.) • Full disclosure: this is where

Quality of service (cont.) • Do all read queries have

Consistency • Do alerts and dashboards have to be separate?

Conﬁguration • In the service-oriented architecture, it’s tempting to conﬁgure

Line charts and scale vs

Summary • Invest in reliability of your monitoring stack to

Thanks! • My team is hiring, come talk to us