Building and running a world-class observability function

chronosphere.io Building and running a world-class observability function Rob Skillington,
Co-Founder and CTO @ Chronosphere

chronosphere.io Rob Skillington, Co-Founder and CTO @ Chronosphere - M3
Open Source Creator - OpenMetrics Contributor - Twitter: @roskilli About me

chronosphere.io Agenda Deﬁning observability People and functions KPIs and metrics
Taming data growth

+ What is Observability? Metrics + Logs + Traces? +

chronosphere.io Know Triage Understand How we think about observability •
How quickly do I get notified when something is wrong? Is it BEFORE a user/customer has a bad experience? • How easily and quickly can I triage it to know what the impact is? • How do I find the underlying cause so I can fix the problem? Remediate

chronosphere.io Responsibilities of an observability function Deﬁne Deﬁne monitoring standards
and practices. This includes documentation, guides, and best practices. Deliver Provide monitoring data to eng teams. Must be in a format they are familiar with (i.e., Grafana). Measure Ensure reliability and stability of monitoring solutions. Establish and maintain trust that systems will be available when needed. Manage Manage tooling and storage of metrics data. Make it simple: if it takes a ninja, people won’t use it.

chronosphere.io Who? Can someone internal take this on, or do
we need to go outside the org? Where? Should they report into SRE, Operations, Platform, Eng? When? At what point do we need to make this a full-time role, vs someone’s part-time job? People/roles: building the observability function Mix of people with internal knowledge and context, and others with experience at established SRE practices It depends! If there is an SRE practice, that’s the logical place. Otherwise some centralized function that can act as a governance overlay If you are missing SLI/SLOs. If customers are ﬁnding out about problems before you. If you are ramping up scale in cloud-native.

chronosphere.io SRE and observability sits with the engineering teams Supported
by a central operations/DevOps org Challenges with this approach: • Teams using many different tools and process • No global view across everything • Teams reinventing the wheel Distributed SRE/observability Team D Team C Team A SRE Team B SRE SRE SRE Engineering leadership Clouds ops/tech ops

chronosphere.io Centralized observability team ➔ Define monitoring standards and practices
➔ Provide monitoring data to engineering teams ➔ Manage tooling and storage of metrics data ➔ Ensure reliability and stability of monitoring solutions Central observability team Team B Team D Team C Team A Has high cardinality data Needs a 10 sec scrape interval Uses StatsD for metrics Terrible at tagging

• What you emit and how you consume the telemetry
data is critical ◦ More data is not more better • Teach the DevOps teams how to ﬁsh ◦ According to a GitLab survey, in 2021 only 26% of developers instrument their code for production monitoring (up from 18% in 2020). ◦ Delegate to get to an outsized impact with a small team • Don’t build if you don’t have to! ◦ COTS tools can help ◦ Look for the most user-friendly vendor or open source tooling ◦ Minimal effort, and maximal future prooﬁng Staying focused with a small team

chronosphere.io • Core function of SRE and DevOps • Initially
2 in 100, then 5 in 500 and eventually grew to 50 in 2500 • Not a lot of good benchmarks out there • At Uber it grew to 8% of infrastructure cost at its peak, then was hyper optimized to 3% Internal KPIs and metrics - meta metrics • Are there reasonable SLO/SLIs in place and are they being met? • Internal and external NPS • Error rate and speed of mitigation How many FTEs on the team? How much should we be investing? How do you measure success?

chronosphere.io 1 Product in 3 Cities 1 Monolith 10s Hosts
10s Products in 100 Cities 200 Services 1000s VMs 100s Products in 600 Cities 4,000 Microservices 1,000,000s Containers Data Growth @ 1.5B datapoints/s 10X Cost Efficiency 99.99% Reliability Growth in monitoring data at Uber Metrics & Monitoring Team • Founded 2015

chronosphere.io Monolith Microservice Microservice Microservice Microservice Microservice Microservice Virtual Machine
CNTR Application Infrastructure Microservice Microservice Microservice Microservice Microservice Microservice CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR CNTR 1:1 1:1 M:M M:M Product / Service Use Cases Experiment Clients Geography Business Cloud (IaaS, VM-based) 2008 - 2018 Cloud Native (Microservices and Containers) 2018 - ? Legacy monitoring built to handle this level of complexity Cloud-native monitoring built to handle this level of complexity • More to monitor = Larger volume of monitoring data • Ephemeral infrastructure = Non uniform storage and usage patterns for monitoring data • Greater interdependencies = Higher cardinality of monitoring data + greater need to correlate and connect infrastructure to applications to business metrics • Same scale environment results in much higher monitoring bill Cloud native impact

chronosphere.io Ok, now what? Tips for taming data growth Scenario
1: Tensions between too much and not enough information Scenario 2: Cardinality of metrics is too much to manage at micro level Scenario 3: Ownership needed beyond the Observability team - it’s a team effort!

chronosphere.io Increase automation where possible - Automated onboarding of new
use cases for monitoring and observability - Create a consistent experience for everyone and a global view Monitor the monitor - Alerting on your metrics system uptime/availability Control data flow - Rate and Query Limiters - Retain data at granularity necessary for business needs, not just the firehouse Safe way to experiment and quickly iterate - Sandboxing and safe operating procedures for changes to pipeline Upgrades for building a reliable, efficient function

chronosphere.io Your observability practice should focus on: • How do
I get notified when something is wrong? • How easily and quickly can I triage it to know what the impact is? • How do I find the underlying cause so I can fix the problem? Understand the tradeoffs of centralized versus decentralized observability functions Plan for taming data growth and cardinality at the start Uplevel your function with automation and safe experimentation Key takeaways

Thank you

Building and running a world-class observabilit...

Building and running a world-class observability function

Rob Skillington

More Decks by Rob Skillington

Other Decks in Programming

Featured

Transcript

chronosphere.io Building and running a world-class observability function Rob Skillington,

chronosphere.io Rob Skillington, Co-Founder and CTO @ Chronosphere - M3

chronosphere.io Agenda Deﬁning observability People and functions KPIs and metrics

+ What is Observability? Metrics + Logs + Traces? +

chronosphere.io Know Triage Understand How we think about observability •

chronosphere.io Responsibilities of an observability function Deﬁne Deﬁne monitoring standards

chronosphere.io Who? Can someone internal take this on, or do

chronosphere.io SRE and observability sits with the engineering teams Supported

chronosphere.io Centralized observability team ➔ Define monitoring standards and practices

• What you emit and how you consume the telemetry

chronosphere.io • Core function of SRE and DevOps • Initially

chronosphere.io 1 Product in 3 Cities 1 Monolith 10s Hosts

chronosphere.io Monolith Microservice Microservice Microservice Microservice Microservice Microservice Virtual Machine

chronosphere.io Ok, now what? Tips for taming data growth Scenario

chronosphere.io Increase automation where possible - Automated onboarding of new

chronosphere.io Your observability practice should focus on: • How do

Thank you