Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring and Observability @ Twitter

Monitoring and Observability @ Twitter

Talk about monitoring, August 2014.

Alex Yarmula

August 14, 2014


  1. Monitoring and infrastructure • Monitoring is Tier 0: can’t be

    less available than the systems under monitoring • Therefore, can’t rely on systems you’re monitoring to be part of your infrastructure • Understand which things have to be in your control
  2. Getting the metrics out • Instrument the code with metrics

    directly • higher chance of capturing the important knowledge than after-the-fact metrics • when the logic changes, it’s reflected in metrics right away • Don’t make people explicitly add/register metrics
  3. Getting the metrics in • What’s the metric -> monitoring

    path? • Don’t onboard services and hosts onto your monitoring stack - make it automatic • Choose between pushing metrics vs polling metrics • Maximizing the control over sending while minimizing failure scenarios • Centralized collectors vs decentralized agents
  4. Discovering the data • Assign a name and address for

    each piece of data gathered “for all front end servers, sum 15-min load avg” • Ability to query the data without knowing where it resides • Ability to perform maintenances, move data sources
  5. Discovering the data (cont.) Decouple monitoring from the knowledge about

    instances: “look at all the backend Riak metrics” vs “see metrics coming from my-riak-01.db.startup.ca, my-riak-02.db.startup.ca, my-riak-03.db.startup.ca”
  6. Discovering events • Metrics are only the projection of reality

    into your measuring devices • Keep track of higher-level events to provide context for the metrics
  7. Going faster • Invest in high-frequency metrics (1s, 10s) •

    Doesn’t have to go through the main monitoring path • can stream through WebSockets • can have high-frequency collection • High frequency != low latency • consider store -> batch -> forward
  8. Visualizations “Go to these pages and look at the graphs”

    vs Mathematica-style “enter your functions”
  9. • “Live” graphs: auto-update periodically as the data arrives •

    Allow overlaying time-series • Consider log scale to compare results of different magnitudes Visualizations
  10. Quality of service • Your workload is write-many read-few •

    Protect against heavy reads: evaluate per-query costs, kill expensive queries, track hot requests • Protect against heavy writes: impose quotas, prioritize or drop
  11. Quality of service (cont.) • Full disclosure: this is where

    we’re heading now • Some metrics are more equal than others. If you know which ones that are more important, you can protect them better • At a certain size, you can’t put all of your metrics in one place. Have to isolate for reliability, then federate
  12. Quality of service (cont.) • Do all read queries have

    the same SLA? Can some be answered in minutes and not seconds? • Can some writes be aggregated offline? Can they be approximated and then improved (e.g. Lambda Architecture model)
  13. Consistency • Do alerts and dashboards have to be separate?

    • In a metrics-driven organization, how does one discover metrics that matter the most? • In a service-oriented environment, who monitors your dependencies? Who gets paged?
  14. Configuration • In the service-oriented architecture, it’s tempting to configure

    each system separately • Filtering metrics, aggregating metrics on multiple levels adds to complexity • Attempt to consolidate the most important pieces of configuration early on
  15. Summary • Invest in reliability of your monitoring stack to

    increase reliability of your company’s services • Visualize to reduce cognitive cost • Optimize for the shortest path to get to the data • Make an effort to get to high-frequency data
  16. Thanks! • My team is hiring, come talk to us

    later twitter.com/jobs • There are a few Observability team members here tonight • Share your monitoring experiences with us!