Monitoring In Motion
Challenges in monitoring kubernetes, containers,
and dynamic infrastructure.
ContainerCon Toronto
Aug 24, 2016
Ilan Rabinovitch
Director, Technical Community
Datadog
Slide 2
Slide 2 text
$ finger ilan@datadog
[datadoghq.com]
Name: Ilan Rabinovitch
Role: Director, Technical Community
Interests:
* Monitoring and Metrics
* Large scale web operations
* FL/OSS Community Events
Slide 3
Slide 3 text
• SaaS based infrastructure and app monitoring
• Open Source Agent
• Time series data (metrics and events)
• Processing nearly a trillion data points per day
• Intelligent Alerting
• We’re hiring! (www.datadoghq.com/careers/)
Datadog Overview
Slide 4
Slide 4 text
Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches,
Queues and more...
Monitor Everything
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
$ cat ~/.plan
1. Intro: The Importance of Monitoring
2. The Challenge: Monitoring Dynamic Infrastructure
3. Finding the Signal: How do we know what to monitor?
4. Implementation: Applying it to Containerized Workloads
Slide 7
Slide 7 text
Our Focus Area
Culture
Automation
Metrics
Sharing
Damon Edwards and John Willis
DevOps Day LA
Slide 8
Slide 8 text
Culture
“organizations which design systems ...
are constrained to produce designs
which are copies of the communication
structures of these organizations”
- Melvin E. Conway
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
Follow @honest_update on Twitter
Slide 15
Slide 15 text
Collecting data is cheap;
not having it when you
need it can be expensive
Slide 16
Slide 16 text
Instrument all the things!
Slide 17
Slide 17 text
Sharing
Looping Back on Culture
Describe the problem as your
“enemy” not each other
Learn Together
Slide 18
Slide 18 text
Sharing
Using and Sharing the same
metrics and measurements
across teams is key to avoiding
misunderstandings.
Slide 19
Slide 19 text
Source: http://bit.ly/1SvvbuP
Slide 20
Slide 20 text
Source: http://bit.ly/1RQRsXW
Slide 21
Slide 21 text
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
Slide 22
Slide 22 text
https://www.datadoghq.com/docker-adoption/
Slide 23
Slide 23 text
How much we measure?
1 instance
• 10 metrics from cloud providers
1 operating system (e.g., Linux)
• 100 metrics
50~ metrics per application
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
How much we measure?
1 instance
• 10 metrics from cloud providers
1 operating system (e.g., Linux)
• 100 metrics
50~ metrics per application
N containers
• 150*N metrics
How much we measure?
1 instance
• 10 metrics from cloud providers
1 operating system (e.g., Linux)
• 100 metrics
50~ metrics per application
N containers
• 150*N metrics
Metrics Overload!
Slide 30
Slide 30 text
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
Slide 31
Slide 31 text
Source: Datadog
Slide 32
Slide 32 text
Source: http://bit.ly/1qFylWK
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
Slide 35
Slide 35 text
No content
Slide 36
Slide 36 text
No content
Slide 37
Slide 37 text
Open Questions
• Where is my container running?
• What is the capacity of my cluster?
• What port is my app running on?
• What’s the total throughput of my app?
• What’s its response time per tag? (app, version, region)
• What’s the distribution of 5xx error per container?
Slide 38
Slide 38 text
Source: http://bit.ly/1YxJ7Jy
Slide 39
Slide 39 text
More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/
Slide 40
Slide 40 text
Monitoring 101
Slide 41
Slide 41 text
Finding Signal - Categorizing Your Metrics
Slide 42
Slide 42 text
No content
Slide 43
Slide 43 text
No content
Slide 44
Slide 44 text
No content
Slide 45
Slide 45 text
No content
Slide 46
Slide 46 text
Examples: NGINX - Metrics
Work Metrics:
• Requests Per Second
• Request Time
• Error Rates (4xx or 5xx)
• Success (2xx)
Resource Metrics:
• Disk I/O
• Memory
• CPU
• Queue Length
Slide 47
Slide 47 text
Examples: NGINX - Events
• Configuration Change
• Code Deployment
• Service Started / Stopped
Slide 48
Slide 48 text
Examples: Events
Slide 49
Slide 49 text
When to let a sleeping
engineer lie?
Slide 50
Slide 50 text
When to alert?
Slide 51
Slide 51 text
Recurse until you find root cause
Slide 52
Slide 52 text
What to demand from our
monitoring tooling?
Slide 53
Slide 53 text
Cryptic Alerts
W
H
A
T
?
Slide 54
Slide 54 text
EVERY ALERT MUST BE ACTIONABLE
Slide 55
Slide 55 text
Host Centric
Slide 56
Slide 56 text
Service Centric
Slide 57
Slide 57 text
Static configurations tracking dynamic infrastructure are not a
recipe for success.
Static vs Dynamic
Slide 58
Slide 58 text
No content
Slide 59
Slide 59 text
No content
Slide 60
Slide 60 text
Query Based Monitoring
“What’s the average throughput of
application:nginx per version ?”
“Alert me when one of my pod from replication
controller:foo is not behaving like the others?”
“Show me rate of HTTP 500 responses from nginx”
“… across all data centers”
“… running my app version 2….”
Container Events
• Starting / Stopping Containers
• Scaling Events for Underlying Instances
• Deploying a new container build
Slide 64
Slide 64 text
No content
Slide 65
Slide 65 text
How do we get at the upper layers?
Slide 66
Slide 66 text
Getting at the Metrics
CPU METRICS MEMORY METRICS I/O METRICS
NETWORK
METRICS
pseudo-files Yes Yes Some Yes, in 1.6.1+
stats command Basic Basic No Basic
API Yes Yes Some Yes
Slide 67
Slide 67 text
Pseudo-files
• Provide visibility into container metrics via the file system.
• Generally under:
/cgroup//docker/$CONTAINER_ID/
or
/sys/fs/cgroup//docker/$CONTAINER_ID/
Slide 68
Slide 68 text
Pseudo-files: CPU Metrics
$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat
> user 2451 # time spent running processes since boot
> system 966 # time spent executing system calls since boot
$ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat
> nr_periods 565 # Number of enforcement intervals that have elapsed
> nr_throttled 559 # Number of times the group has been throttled
> throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)
Pseudo-files: CPU Throttling
Slide 69
Slide 69 text
Docker API
• Detailed streaming metrics as JSON HTTP socket
$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/
28d7a95f468e/stats
Slide 70
Slide 70 text
STATS Command
# Usage: docker stats CONTAINER [CONTAINER...]
$ docker stats $CONTAINER_ID
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O
ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB
Slide 71
Slide 71 text
Side Car Containers
Slide 72
Slide 72 text
Aren’t we still missing a layer?
Slide 73
Slide 73 text
Open Questions
• What is the capacity of my cluster?
• What’s the total throughput of my app?
• What’s its response time per tag? (app, version, region)
• What’s the distribution of 5xx error per container?
• Where is my container running? what port?
Slide 74
Slide 74 text
Service Discovery
Docker API Orchestrator
Monitoring Agent
Container
A O A O
Containers List &
Metadata
Additional Metadata
(Tags, etc)
Config Backend
Integration Configurations
Host Level
Metrics
Slide 75
Slide 75 text
No content
Slide 76
Slide 76 text
Custom Metrics
• Instrument custom applications
• You know your key transactions best.
• Use async protocols like Etys’ STATSD or
DogstatsD
Slide 77
Slide 77 text
Source: http://bit.ly/1NoW6aj
Slide 78
Slide 78 text
Resources
Monitoring 101: Alerting
https://www.datadoghq.com/blog/monitoring-101-alerting/
Monitoring 101: Collecting the Right Data
https://www.datadoghq.com/blog/monitoring-101-collecting-data/
Monitoring 101: Investigating performance issues
https://www.datadoghq.com/blog/monitoring-101-investigation/
The Power of Tagged Metrics
https://www.datadoghq.com/blog/the-docker-monitoring-problem/
How to Collect Docker Metrics
https://www.datadoghq.com/blog/how-to-collect-docker-metrics/
8 surprising facts about Docker Adoption
https://www.datadoghq.com/docker-adoption/