Monitoring in Motion - ContainerCon 2016

Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic
infrastructure. ContainerCon Toronto Aug 24, 2016 Ilan Rabinovitch Director, Technical Community  Datadog

$ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical
Community   Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events

• SaaS based infrastructure and app monitoring • Open Source
Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/) Datadog Overview

Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues
and more... Monitor Everything

$ cat ~/.plan 1. Intro: The Importance of Monitoring 2.
The Challenge: Monitoring Dynamic Infrastructure 3. Finding the Signal: How do we know what to monitor? 4. Implementation: Applying it to Containerized Workloads

Our Focus Area Culture Automation Metrics Sharing Damon Edwards and
John Willis DevOps Day LA

Culture “organizations which design systems ... are constrained to produce
designs which are copies of the communication structures of these organizations” - Melvin E. Conway

Follow @honest_update on Twitter

Collecting data is cheap;  not having it when you need
it can be expensive

Instrument all the things!

Sharing Looping Back on Culture Describe the problem as your
“enemy” not each other Learn Together

Sharing Using and Sharing the same metrics and measurements across
teams is key to avoiding misunderstandings.

Source: http://bit.ly/1SvvbuP

Source: http://bit.ly/1RQRsXW

Operational Complexity Increases with.. • Number of things to measure 
• Velocity of change

https://www.datadoghq.com/docker-adoption/

How much we measure? 1 instance • 10 metrics from
cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application

cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application  N containers • 150*N metrics

Operational Complexity 100 instances 500 containers

Operational Complexity: Scale 160 metrics per host 800 metrics per
host Assuming 5 containers per host

Operational Complexity: Scale 100 instances 80,000 metrics Assuming 5 containers
per host

cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application  N containers • 150*N metrics Metrics Overload!

Source: Datadog

Source: http://bit.ly/1qFylWK

Open Questions • Where is my container running? • What
is the capacity of my cluster? • What port is my app running on? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container?

Source: http://bit.ly/1YxJ7Jy

More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

Monitoring 101

Finding Signal - Categorizing Your Metrics

Examples: NGINX - Metrics Work Metrics:   • Requests Per
Second • Request Time • Error Rates (4xx or 5xx) • Success (2xx) Resource Metrics:  • Disk I/O • Memory • CPU • Queue Length

Examples: NGINX - Events • Configuration Change • Code Deployment
• Service Started / Stopped

Examples: Events

When to let a sleeping engineer lie?

When to alert?

Recurse until you find root cause

What to demand from our monitoring tooling?

Cryptic Alerts W H A T ?

EVERY ALERT MUST BE ACTIONABLE

Host Centric

Service Centric

Static configurations tracking dynamic infrastructure are not a recipe for
success. Static vs Dynamic

Query Based Monitoring “What’s the average throughput of application:nginx per
version ?” “Alert me when one of my pod from replication controller:foo is not behaving like the others?” “Show me rate of HTTP 500 responses from nginx” “… across all data centers” “… running my app version 2….”

Getting at the metrics…

Resource Metrics Utilization: • CPU (user + system) • memory
• i/o • network traffic Saturation • throttling • swap Error • Network Errors   (receive vs transmit)

Container Events • Starting / Stopping Containers • Scaling Events
for Underlying Instances • Deploying a new container build

How do we get at the upper layers?

Getting at the Metrics CPU METRICS MEMORY METRICS I/O METRICS
NETWORK METRICS pseudo-files Yes Yes Some Yes, in 1.6.1+ stats command Basic Basic No Basic API Yes Yes Some Yes

Pseudo-files • Provide visibility into container metrics via the file
system. • Generally under:   /cgroup/<resource>/docker/$CONTAINER_ID/   or  /sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/ 

Pseudo-files: CPU Metrics $ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 #
time spent running processes since boot > system 966 # time spent executing system calls since boot $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds) Pseudo-files: CPU Throttling

Docker API • Detailed streaming metrics as JSON HTTP socket 
$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats 

STATS Command # Usage: docker stats CONTAINER [CONTAINER...] $ docker
stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

Side Car Containers

Aren’t we still missing a layer?

Open Questions • What is the capacity of my cluster?
• What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container? • Where is my container running? what port?

Service Discovery Docker API Orchestrator Monitoring Agent Container A O
A O Containers List & Metadata Additional Metadata (Tags, etc) Config Backend Integration Configurations Host Level Metrics

Custom Metrics • Instrument custom applications  • You know your
key transactions best.  • Use async protocols like Etys’ STATSD or   DogstatsD

Source: http://bit.ly/1NoW6aj

Resources Monitoring 101: Alerting   https://www.datadoghq.com/blog/monitoring-101-alerting/ Monitoring 101: Collecting the
Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/ Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/  The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/ How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/ 8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/

Monitoring in Motion - ContainerCon 2016

Monitoring in Motion - ContainerCon 2016

More Decks by Ilan Rabinovitch

Other Decks in Technology

Featured

Transcript