Slide 1

Slide 1 text

Measuring the infrastructure resiliency @twalex Alex Yarmula

Slide 2

Slide 2 text

Monitoring and infrastructure • Monitoring is Tier 0: can’t be less available than the systems under monitoring • Therefore, can’t rely on systems you’re monitoring to be part of your infrastructure • Understand which things have to be in your control

Slide 3

Slide 3 text

Getting the metrics out • Instrument the code with metrics directly • higher chance of capturing the important knowledge than after-the-fact metrics • when the logic changes, it’s reflected in metrics right away • Don’t make people explicitly add/register metrics

Slide 4

Slide 4 text

Getting the metrics in • What’s the metric -> monitoring path? • Don’t onboard services and hosts onto your monitoring stack - make it automatic • Choose between pushing metrics vs polling metrics • Maximizing the control over sending while minimizing failure scenarios • Centralized collectors vs decentralized agents

Slide 5

Slide 5 text

Discovering the data • Assign a name and address for each piece of data gathered “for all front end servers, sum 15-min load avg” • Ability to query the data without knowing where it resides • Ability to perform maintenances, move data sources

Slide 6

Slide 6 text

Discovering the data (cont.) Decouple monitoring from the knowledge about instances: “look at all the backend Riak metrics” vs “see metrics coming from my-riak-01.db.startup.ca, my-riak-02.db.startup.ca, my-riak-03.db.startup.ca”

Slide 7

Slide 7 text

Discovering events • Metrics are only the projection of reality into your measuring devices • Keep track of higher-level events to provide context for the metrics

Slide 8

Slide 8 text

Going faster • Invest in high-frequency metrics (1s, 10s) • Doesn’t have to go through the main monitoring path • can stream through WebSockets • can have high-frequency collection • High frequency != low latency • consider store -> batch -> forward

Slide 9

Slide 9 text

Going faster

Slide 10

Slide 10 text

Visualizations “Go to these pages and look at the graphs” vs Mathematica-style “enter your functions”

Slide 11

Slide 11 text

Example queries expmovingavg(10, 0.9, ts(sum, myservice, myservice.hostgroupA, api_store)) groupby(metric, sum(_), ts(sum, myservice, members(myservice.hostgroupA), api_store))

Slide 12

Slide 12 text

Visualizations Side-by-side
 data representations:

Slide 13

Slide 13 text

• “Live” graphs: auto-update periodically as the data arrives • Allow overlaying time-series • Consider log scale to compare results of different magnitudes Visualizations

Slide 14

Slide 14 text

Problems at Twitter scale

Slide 15

Slide 15 text

Quality of service • Your workload is write-many read-few • Protect against heavy reads: evaluate per-query costs, kill expensive queries, track hot requests • Protect against heavy writes: impose quotas, prioritize or drop

Slide 16

Slide 16 text

Quality of service (cont.) • Full disclosure: this is where we’re heading now • Some metrics are more equal than others. If you know which ones that are more important, you can protect them better • At a certain size, you can’t put all of your metrics in one place. Have to isolate for reliability, then federate

Slide 17

Slide 17 text

Quality of service (cont.) • Do all read queries have the same SLA? Can some be answered in minutes and not seconds? • Can some writes be aggregated offline? Can they be approximated and then improved (e.g. Lambda Architecture model)

Slide 18

Slide 18 text

Consistency • Do alerts and dashboards have to be separate? • In a metrics-driven organization, how does one discover metrics that matter the most? • In a service-oriented environment, who monitors your dependencies? Who gets paged?

Slide 19

Slide 19 text

Configuration • In the service-oriented architecture, it’s tempting to configure each system separately • Filtering metrics, aggregating metrics on multiple levels adds to complexity • Attempt to consolidate the most important pieces of configuration early on

Slide 20

Slide 20 text

Line charts and scale vs

Slide 21

Slide 21 text

Summary • Invest in reliability of your monitoring stack to increase reliability of your company’s services • Visualize to reduce cognitive cost • Optimize for the shortest path to get to the data • Make an effort to get to high-frequency data

Slide 22

Slide 22 text

Thanks! • My team is hiring, come talk to us later twitter.com/jobs • There are a few Observability team members here tonight • Share your monitoring experiences with us!