Monitoring and infrastructure • Monitoring is Tier 0: can’t be less available than the systems under monitoring • Therefore, can’t rely on systems you’re monitoring to be part of your infrastructure • Understand which things have to be in your control
Getting the metrics out • Instrument the code with metrics directly • higher chance of capturing the important knowledge than after-the-fact metrics • when the logic changes, it’s reflected in metrics right away • Don’t make people explicitly add/register metrics
Getting the metrics in • What’s the metric -> monitoring path? • Don’t onboard services and hosts onto your monitoring stack - make it automatic • Choose between pushing metrics vs polling metrics • Maximizing the control over sending while minimizing failure scenarios • Centralized collectors vs decentralized agents
Discovering the data • Assign a name and address for each piece of data gathered “for all front end servers, sum 15-min load avg” • Ability to query the data without knowing where it resides • Ability to perform maintenances, move data sources
Discovering the data (cont.) Decouple monitoring from the knowledge about instances: “look at all the backend Riak metrics” vs “see metrics coming from my-riak-01.db.startup.ca, my-riak-02.db.startup.ca, my-riak-03.db.startup.ca”
Discovering events • Metrics are only the projection of reality into your measuring devices • Keep track of higher-level events to provide context for the metrics
Going faster • Invest in high-frequency metrics (1s, 10s) • Doesn’t have to go through the main monitoring path • can stream through WebSockets • can have high-frequency collection • High frequency != low latency • consider store -> batch -> forward
• “Live” graphs: auto-update periodically as the data arrives • Allow overlaying time-series • Consider log scale to compare results of different magnitudes Visualizations
Quality of service • Your workload is write-many read-few • Protect against heavy reads: evaluate per-query costs, kill expensive queries, track hot requests • Protect against heavy writes: impose quotas, prioritize or drop
Quality of service (cont.) • Full disclosure: this is where we’re heading now • Some metrics are more equal than others. If you know which ones that are more important, you can protect them better • At a certain size, you can’t put all of your metrics in one place. Have to isolate for reliability, then federate
Quality of service (cont.) • Do all read queries have the same SLA? Can some be answered in minutes and not seconds? • Can some writes be aggregated offline? Can they be approximated and then improved (e.g. Lambda Architecture model)
Consistency • Do alerts and dashboards have to be separate? • In a metrics-driven organization, how does one discover metrics that matter the most? • In a service-oriented environment, who monitors your dependencies? Who gets paged?
Configuration • In the service-oriented architecture, it’s tempting to configure each system separately • Filtering metrics, aggregating metrics on multiple levels adds to complexity • Attempt to consolidate the most important pieces of configuration early on
Summary • Invest in reliability of your monitoring stack to increase reliability of your company’s services • Visualize to reduce cognitive cost • Optimize for the shortest path to get to the data • Make an effort to get to high-frequency data
Thanks! • My team is hiring, come talk to us later twitter.com/jobs • There are a few Observability team members here tonight • Share your monitoring experiences with us!