Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Datacenter Observability with OSS

Allee
April 06, 2020

Datacenter Observability with OSS

Allee

April 06, 2020
Tweet

More Decks by Allee

Other Decks in Technology

Transcript

  1. Monitoring Anti-Patterns Tool Obsession Lead to a frustrating attempt to

    justify the favored tool's usage in an out-of-context environment. Loss of identity as one who favors that tool, provides us with the perspective to properly process the emotional side of what is an otherwise purely logical decision. Monitoring-as-a-job Are you able to generate service reporting for a service you don’t build or operate. Source: Practical Monitoring Mike Julian 2017
  2. Monitoring Anti-Patterns Using Monitoring as a Crutch Oh of course

    something will break during the release! We’ll add monitoring after that the fact. Manual Configuration I’m sure we all can agree that automation is awesome. That’s why it’s surprising to me how often monitoring configuration is manual. The question I never want to hear is “Can you add this to monitoring?” Your monitoring should be 100% automated. Services should self-register instead of someone having to add them. Source: Practical Monitoring Mike Julian 2017
  3. Monitoring Design Patterns Composable Monitoring The principle is simple: use

    multiple specialized tools and couple them loosely together, forming a monitoring “platform.” Data collection, Data storage, Visualization, Analytics and reporting, Alerting Monitor from the User Perspective The best place to add monitoring first is at the point(s) users interact with your app. A user doesn’t care about the implementation details of your app, such as how many Apache nodes you’re running or how many workers are available for jobs. Your users care about whether the application works. Source: Practical Monitoring Mike Julian 2017
  4. Monitoring Design Patterns Buy, Not Build I’ve noticed a natural

    progression of monitoring tooling and culture as it matures within a company.Companies often start off running purely SaaS services. This allows them to quickly get monitoring up and running, and gives them the ability to focus their efforts on building a great product. Source: Practical Monitoring Mike Julian 2017
  5. What were our goals to refactor? ▪ Lean towards high

    cardinality solutions. We need to arbitrarily analyze subsets of data for compliance ▪ Make sure cost grows proportionally with business volume ▪ Increase investment in service discovery ▪ Reduce gaps between our legacy and first-class monitoring systems ▪ Lean towards open source projects, but for now stay experimental
  6. Host tags or labels CODE EDITOR module "ec2_cluster" { source

    = "terraform-aws-modules/ec2-instance/aws" name = "coredns" instance_count = 5 instance_type = "t2.micro" monitoring = true tags = { Terraform = "true" Environment = "prod" Image_ID = “coredns:latest” Roles = [coredns, internal] Monitoring = “on” Maintenance = enabled } }
  7. What’s the difference? CODE EDITOR module "ec2_cluster" { source =

    "terraform-aws-modules/ec2-instance/aws" tags = { Terraform = "true" Environment = "dev" Image_ID = “prometheus:latest” Roles = [prometheus, internal] Monitoring = “on” Maintenance = enabled } }
  8. How to evaluate the tools in your infrastructure ▪ Push,

    Pull Both, or Neither ▪ Measurement resolution ▪ Data Storage ▪ Analysis Capabilities ▪ Notification Capabilities ▪ Integration Capabilities ▪ Scaling Model
  9. Agents responsible for collecting the performance data from applications, processes,

    hosts, etc. Source: https://openapm.io/landscape/agents
  10. Libraries instrumentation frameworks and libraries can be used to collect

    different types of performance data from applications. Source: https://openapm.io/landscape/libraries
  11. Transport serves as pipelines for data. This includes messaging systems,

    proprietary protocols and exchange formats. Source: https://openapm.io/landscape/transport
  12. Storage persisting the collected performance data to disk (or memory)

    is the responsibility of storage components. Source: https://openapm.io/landscape/storage
  13. Landscape Theory Composable monitoring tools as possible. Problem Too many

    tools and hard to get a clear picture of the difference Practice Deprecate old duck tape for maintained and community supported approached if reasonable. Build our own if it’s too opinionated. Purchase products within our cost plans.
  14. Choosing the right service discovery Theory Auto register your services

    on the host Problem Easily discover targets with associated tags in your datacenter Practice Setting up agents is easy. Choosing one with DNS makes makes registration and deregistration in your ecosystem may be hard.
  15. CODE EDITOR - job_name: 'consul_$DATACENTER’' consul_sd_configs: - server: '127.0.0.1:8500' services:

    ['{{ services_to_scrape | join ('\',\'') }}'] datacenter: '$DATACENTER' relabel_configs: - source_labels: ['__meta_consul_service'] regex: '(.*)' target_label: 'job' replacement: '$1' - source_labels: ['__meta_consul_node'] regex: '(.*)' target_label: 'instance' replacement: '$1'
  16. Daemon is designed for data collection on a single host.

    Vector runs in the background, in its own process, collecting all data for that host. Typically data is collected from a process manager, such as Journald via Vector's journald source, but can be collected through any of Vector's sources.
  17. Service treats Vector like a separate service Designed to receive

    data from an upstream source and fan-out to one or more destinations. Typically, upstream sources are other Vector instances sending data via the vector sink, but can be collected through any of Vector's sources. The following diagram demonstrates how it works.
  18. Guarantees Vector attempts to make it clear which guarantees you

    can expect from it. All components are categorized by their targeted delivery guarantee and also by their general stability. This helps you make the appropriate tradeoffs for your usecase.
  19. Guarantees Types how vector labels the stability of our components.

    Best-Effort Means that Vector will make a best effort to deliver each event, but cannot guarantee delivery. At-Least-Once Ensures that an event received by Vector will be delivered at least once to the configured destination(s). Beta Means that a feature has not met the criteria outlined in the Prod-Ready section and therefore should be used with caution in production environments.
  20. CODE EDITOR [sinks.my_sink_id] # General type = "splunk_hec" # required

    inputs = [" /path/to/query.log"] # required host = "http://prod.splunk.service.consul" # required token = "${SPLUNK_HEC_TOKEN}" # required healthcheck = true # optional, default host_key = "prod-coredns-b" # optional, no default # Encoding encoding.codec = "json" # required