Datacenter Observability with OSS

Open Source Observability for Cloud Datacenters Allee Clark Monitoring heartbeats
at Oscar Health

OUR PRODUCT INTERFACE

What is data center observability?

Practical Monitoring

Monitoring Anti-Patterns Tool Obsession Lead to a frustrating attempt to
justify the favored tool's usage in an out-of-context environment. Loss of identity as one who favors that tool, provides us with the perspective to properly process the emotional side of what is an otherwise purely logical decision. Monitoring-as-a-job Are you able to generate service reporting for a service you don’t build or operate. Source: Practical Monitoring Mike Julian 2017

Monitoring Anti-Patterns Using Monitoring as a Crutch Oh of course
something will break during the release! We’ll add monitoring after that the fact. Manual Conﬁguration I’m sure we all can agree that automation is awesome. That’s why it’s surprising to me how often monitoring conﬁguration is manual. The question I never want to hear is “Can you add this to monitoring?” Your monitoring should be 100% automated. Services should self-register instead of someone having to add them. Source: Practical Monitoring Mike Julian 2017

Good monitoring patterns

Monitoring Design Patterns Composable Monitoring The principle is simple: use
multiple specialized tools and couple them loosely together, forming a monitoring “platform.” Data collection, Data storage, Visualization, Analytics and reporting, Alerting Monitor from the User Perspective The best place to add monitoring ﬁrst is at the point(s) users interact with your app. A user doesn’t care about the implementation details of your app, such as how many Apache nodes you’re running or how many workers are available for jobs. Your users care about whether the application works. Source: Practical Monitoring Mike Julian 2017

Monitoring Design Patterns Buy, Not Build I’ve noticed a natural
progression of monitoring tooling and culture as it matures within a company.Companies often start oﬀ running purely SaaS services. This allows them to quickly get monitoring up and running, and gives them the ability to focus their eﬀorts on building a great product. Source: Practical Monitoring Mike Julian 2017

Why is it important?

What were our goals to refactor? ▪ Lean towards high
cardinality solutions. We need to arbitrarily analyze subsets of data for compliance ▪ Make sure cost grows proportionally with business volume ▪ Increase investment in service discovery ▪ Reduce gaps between our legacy and ﬁrst-class monitoring systems ▪ Lean towards open source projects, but for now stay experimental

How do we get here?

Image Deployment Centos Host Prometheus exporters Sampling agent log agent
conﬁguration agent Consul Agent app services

Host tags or labels CODE EDITOR module "ec2_cluster" { source
= "terraform-aws-modules/ec2-instance/aws" name = "coredns" instance_count = 5 instance_type = "t2.micro" monitoring = true tags = { Terraform = "true" Environment = "prod" Image_ID = “coredns:latest” Roles = [coredns, internal] Monitoring = “on” Maintenance = enabled } }

Why labeling is important?

Meaning of your labels?

What’s the diﬀerence? CODE EDITOR module "ec2_cluster" { source =
"terraform-aws-modules/ec2-instance/aws" tags = { Terraform = "true" Environment = "dev" Image_ID = “prometheus:latest” Roles = [prometheus, internal] Monitoring = “on” Maintenance = enabled } }

Monitoring Is Multiple Complex Problems Under One Name

How to evaluate the tools in your infrastructure ▪ Push,
Pull Both, or Neither ▪ Measurement resolution ▪ Data Storage ▪ Analysis Capabilities ▪ Notiﬁcation Capabilities ▪ Integration Capabilities ▪ Scaling Model

Agents responsible for collecting the performance data from applications, processes,
hosts, etc. Source: https://openapm.io/landscape/agents

Libraries instrumentation frameworks and libraries can be used to collect
diﬀerent types of performance data from applications. Source: https://openapm.io/landscape/libraries

Collectors receive data from the agents or instrumentation framework Source:
https://openapm.io/landscape/collectors

Transport serves as pipelines for data. This includes messaging systems,
proprietary protocols and exchange formats. Source: https://openapm.io/landscape/transport

Storage persisting the collected performance data to disk (or memory)
is the responsibility of storage components. Source: https://openapm.io/landscape/storage

Landscape Theory Composable monitoring tools as possible. Problem Too many
tools and hard to get a clear picture of the diﬀerence Practice Deprecate old duck tape for maintained and community supported approached if reasonable. Build our own if it’s too opinionated. Purchase products within our cost plans.

Prometheus Setup

Label Conﬁguration How do we get here

Choosing the right service discovery Theory Auto register your services
on the host Problem Easily discover targets with associated tags in your datacenter Practice Setting up agents is easy. Choosing one with DNS makes makes registration and deregistration in your ecosystem may be hard.

Deployment

1000 foot view

Label Conﬁguration How do we get here

CODE EDITOR - job_name: 'consul_$DATACENTER’' consul_sd_conﬁgs: - server: '127.0.0.1:8500' services:
['{{ services_to_scrape | join ('\',\'') }}'] datacenter: '$DATACENTER' relabel_conﬁgs: - source_labels: ['__meta_consul_service'] regex: '(.*)' target_label: 'job' replacement: '$1' - source_labels: ['__meta_consul_node'] regex: '(.*)' target_label: 'instance' replacement: '$1'

CODE EDITOR - source_labels: [__meta_consul_tags] regex: '.*(on|of).*' target_label: 'monitoring' replacement:
$1

Deployment

Daemon is designed for data collection on a single host.
Vector runs in the background, in its own process, collecting all data for that host. Typically data is collected from a process manager, such as Journald via Vector's journald source, but can be collected through any of Vector's sources.

Sidecar is designed to collect data from a single service.

Service treats Vector like a separate service Designed to receive
data from an upstream source and fan-out to one or more destinations. Typically, upstream sources are other Vector instances sending data via the vector sink, but can be collected through any of Vector's sources. The following diagram demonstrates how it works.

Components Allow you to collect, transform and route data.

Guarantees Vector attempts to make it clear which guarantees you
can expect from it. All components are categorized by their targeted delivery guarantee and also by their general stability. This helps you make the appropriate tradeoﬀs for your usecase.

Guarantees Types how vector labels the stability of our components.
Best-Effort Means that Vector will make a best effort to deliver each event, but cannot guarantee delivery. At-Least-Once Ensures that an event received by Vector will be delivered at least once to the configured destination(s). Beta Means that a feature has not met the criteria outlined in the Prod-Ready section and therefore should be used with caution in production environments.

Metric Components

CODE EDITOR [sinks.my_sink_id] # General type = "splunk_hec" # required
inputs = [" /path/to/query.log"] # required host = "http://prod.splunk.service.consul" # required token = "${SPLUNK_HEC_TOKEN}" # required healthcheck = true # optional, default host_key = "prod-coredns-b" # optional, no default # Encoding encoding.codec = "json" # required

Final thoughts

Thank You! [email protected] @alleeclark allee.xyz

Datacenter Observability with OSS

Datacenter Observability with OSS

More Decks by Allee

Other Decks in Technology

Featured

Transcript