Monitoring in motion

Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic
infrastructure. CNCF Paris Meetup 16 Nov, 2017 Haïssam Kaj [email protected]

Haïssam Kaj • [email protected] • @ha_kaj • Team lead, container
ecosystem @ Datadog About me

• SaaS based infrastructure and app monitoring • Open Source
Agent • Time series data (metrics and events) and traces • Processing trillions of data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/) Datadog Overview

Operating Systems, Cloud Providers, Containers, Orchestrators, Datastores, Caches, Queues and
more... Monitor Everything

Outline 1. Intro: The Importance of Monitoring 2. The Challenge:
Monitoring Dynamic Infrastructure 3. Finding the Signal: How do we know what to monitor? 4. Wrapping up: Applying this to a Go app on Kubernetes

Collecting data is cheap;  not having it when you need
it can be expensive

Instrument all the things!

Source: http://bit.ly/1SvvbuP

Source: http://bit.ly/1RQRsXW

Operational Complexity Increases with.. • Number of things to measure 
• Velocity of change

https://www.datadoghq.com/docker-adoption/

How much we measure? 1 instance • 10 metrics from
cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application

cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application  N containers • 150*N metrics

Operational Complexity 100 instances 700 containers

Operational Complexity: Scale 160 metrics per host 1120 metrics per
host Assuming 7 containers per host

Operational Complexity: Scale 100 instances 112 000 metrics Assuming 7
containers per host

cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application  N containers • 150*N metrics Metrics Overload!

Source: Datadog

Enter the Orchestrators

Open Questions • Where is my container running? • What
is the capacity of my cluster? • What port is my app running on? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container?

Monitoring 101

Demo setup - github.com/hkaj/demo-app

Deployment Demo setup - github.com/hkaj/demo-app StatefulSet pod DaemonSet Service Service
pod pod Deployment Service pod

Finding Signal - Categorizing Your Metrics

Recurse until you find root cause

What to demand from our monitoring tooling?

Host Centric

Service Centric

Query Based Monitoring “What’s the average throughput of application:nginx per
version ?” “Alert me when one of my pod from deployment:foo is not behaving like the others?” “Show me rate of HTTP 500 responses from nginx” “… across all data centers” “… running my app version 2….”

Side Car agent

Getting at the metrics…

How do we get at the upper layers?

Resource Metrics Utilization: • CPU (user + system) • memory
• i/o • network traffic Saturation • throttling • swap Error • Network Errors   (receive vs transmit)

Docker & Kubernetes Events • Starting / Stopping Containers •
Scaling Events for Underlying Instances • Deploying a new container build

Containers

STATS Command # Usage: docker stats CONTAINER [CONTAINER...] $ docker
stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

Docker API • Detailed streaming metrics as JSON HTTP socket 
$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats 

Pseudo-files • Provide visibility into container metrics via the file
system. • Generally under:   /sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/ 

Pseudo-files: CPU Metrics $ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 #
time spent running processes since boot > system 966 # time spent executing system calls since boot $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds) Pseudo-files: CPU Throttling

Kubernetes

metrics - cluster • Generates metrics about the state of
Kubernetes objects (nodes, pods, services, jobs…) • Network, disk stats • Leader status • Work metrics (consensus proposals, wal sync…) apiserver(s) • status check • resource metrics

metrics - node kubelet /healthz health check /pods metadata /api/v1.3/machine/
(cAdvisor) node resources

LAST SEEN NAME KIND REASON SOURCE MESSAGE 22m dd-agent-2pml8.14f45a4ece3aeca4 Pod
Killing kubelet, gke-haissam-dl13 Killing container with id dd-agent:Need to kill Pod 21m dd-agent-482vl.14f45a5618aea4c0 Pod SuccessfulMountVolume kubelet, gke-haissam-wnvn MountVolume.SetUp succeeded for volume "cgroups" 21m dd-agent-482vl.14f45a5632a1e86d Pod Pulling kubelet, gke-haissam-wnvn pulling image "datadog/docker-dd-agent:latest" 21m dd-agent-482vl.14f45a5649590c91 Pod Created kubelet, gke-haissam-wnvn Created container 21m dd-agent-482vl.14f45a5650fb2dfd Pod Started kubelet, gke-haissam-wnvn Started container 22m dd-agent.14f45a4ea0acb0c0 DaemonSet SuccessfulDelete daemon-set Deleted pod: dd-agent-2pml8 19m nginx-deployment Deployment ScalingReplicaSet deployment-controller Scaled down replica set nginx-569477d6d8 to 0 events

Aren’t we still missing a layer?

Auto Discovery Docker API Kubernetes Monitoring Agent Container A O
A O Containers List & Metadata Additional Metadata (Tags, etc) Config Backends Integration Configurations Host Level Metrics

Custom Metrics • Instrument custom applications  • PUSH • STATSD
• DogStatsD • PULL • Go Expvar, Prometheus, JMX, …

Monitoring Questions • Where is a given container running? •
What is the overall capacity of my cluster? • What port(s) are my applications running on? • What’s the total throughput of my application? • What’s its response time per tag? (app, version, data center) • What’s the distribution of 5xx error per container? What about by data center?

Resources Monitoring 101: Alerting   https://www.datadoghq.com/blog/monitoring-101-alerting/ Monitoring 101: Collecting the
Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/ Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/  The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/ 8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/ Control groups, part 4: On accounting https://lwn.net/Articles/606004/

Q&A more questions? [email protected] @ha_kaj

Monitoring in motion

Monitoring in motion

Other Decks in Technology

Featured

Transcript