$ finger ilan@datadog
[datadoghq.com]
Name: Ilan Rabinovitch
Role: Director, Technical Community
Interests:
* Open Source
* Large scale web operations
* Monitoring and Metrics
* Planning FL/OSS and DevOps Events
(SCALE, TXLF, DevOpsDays, and more…)
Slide 3
Slide 3 text
• SaaS based infrastructure monitoring
• Focus on modern infrastructure
• Cloud, Containers, Micro Services
• Processing nearly a trillion data points per day
• Intelligent Alerting
Datadog Overview
Slide 4
Slide 4 text
Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores,
Caches, Queues and more...
Monitor Everything
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
$ cat ~/.plan
1. Introduction: Why Containerize?
2. How: Collecting Docker and ECS Metrics
3. Finding the Signal: How do we know what to monitor?
4. Practice: Fitting it all together on ECS
Slide 7
Slide 7 text
Why Containerization?
Slide 8
Slide 8 text
More info at: www.datadoghq.com/docker-adoption/
Slide 9
Slide 9 text
Why Containers?
• Avoid Dependency Hell
Slide 10
Slide 10 text
Why Containers?
• Avoid Dependency Hell
• Single Artifact Deployments
ECS - Elastic Container Services
• Automatically manages and schedules
your containers as ‘tasks’
• Ensures tasks are always running
based on your parameters
• Integration with load balancing and
routing via ELB.
Slide 17
Slide 17 text
Monitoring in Motion
How do you define and monitor for normal when everything is changing around you?
Between ECS and Containers you now
have:
• Containers moving between hosts.
• Changing ports
• and other changes underneath your feet.
Slide 18
Slide 18 text
Adding up the numbers…
Docker Status API: 220+ Metrics per container
Slide 19
Slide 19 text
Adding up the numbers…
Docker Status API: 223+ Metrics per container
ECS CloudWatch Metrics: 4 per cluster + 2 per service
Slide 20
Slide 20 text
Adding up the numbers…
Docker Status API: 223+ Metrics per container
ECS CloudWatch Metrics: 4 per cluster + 2 per service
OS Metrics: 100~ per instance
Slide 21
Slide 21 text
Docker Status API: 223+ Metrics per container
ECS CloudWatch Metrics: 4 per cluster + 2 per service
OS Metrics: 100~ per instance
App Metrics: 50~
Adding up the numbers…
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
Adding up the numbers…
OS Metrics: 100~ per instance
Docker Status API: 223+ Metrics per container
ECS CloudWatch Metrics: 4 per cluster + 2 per service
App Metrics: 50~
Metrics Overload!
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
Host Centric
Slide 26
Slide 26 text
Service Centric
Slide 27
Slide 27 text
Avoiding Gaps
Slide 28
Slide 28 text
No content
Slide 29
Slide 29 text
Tags All the Way Down
Slide 30
Slide 30 text
Moving from statements to tag based queries
“Monitor all containers running image web
in region us-west-2 across all availability zones
that use more than 1.5x the average memory on
c3.xlarge”
Slide 31
Slide 31 text
Monitoring 101
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
Collecting data is cheap;
not having it when you
need it can be expensive
Slide 35
Slide 35 text
Instrument all the things!
Slide 36
Slide 36 text
No content
Slide 37
Slide 37 text
Monitoring 101: tl;dr Edition
More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/
Slide 38
Slide 38 text
tl;dr - Data Types
Slide 39
Slide 39 text
Examples: NGINX - Metrics
Work Metrics:
Requests Per Second
• Dropped
Connections
• Request Time
• Error Rates
Resource Metrics:
• Disk I/O
• Memory
• CPU
• Queue Length
Slide 40
Slide 40 text
Examples: NGINX - Events
• Configuration Change
• Code Deployment
• Service Started / Stopped
• etc
Slide 41
Slide 41 text
When to let a
sleeping engineer lie?
Slide 42
Slide 42 text
When to alert?
Slide 43
Slide 43 text
Recurse until you find root cause
Slide 44
Slide 44 text
Getting at the Metrics
• ECS vs Docker
• Work Metrics vs Resource Metrics
CloudWatch and ECS
Resources
CPUReservation
MemoryReservation
CPUUtilization
MemoryUtilization
Slide 49
Slide 49 text
How do we get at the upper layers?
Slide 50
Slide 50 text
Getting at the Metrics
CPU METRICS MEMORY METRICS I/O METRICS
NETWORK
METRICS
pseudo-files Yes Yes Some Yes, in 1.6.1+
stats command Basic Basic No Basic
API Yes Yes Some Yes
Slide 51
Slide 51 text
Pseudo-files
• Provide visibility into container metrics via the file system.
• Generally under:
/cgroup//docker/$CONTAINER_ID/
or
/sys/fs/cgroup//docker/$CONTAINER_ID/
Slide 52
Slide 52 text
Pseudo-files: CPU Metrics
$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat
> user 2451 # time spent running processes since boot
> system 966 # time spent executing system calls since boot
$ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat
> nr_periods 565 # Number of enforcement intervals that have elapsed
> nr_throttled 559 # Number of times the group has been throttled
> throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)
Pseudo-files: CPU Throttling
Slide 53
Slide 53 text
Docker API
• Detailed streaming metrics as JSON HTTP socket
$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/
28d7a95f468e/stats
Slide 54
Slide 54 text
STATS Command
# Usage: docker stats CONTAINER [CONTAINER...]
$ docker stats $CONTAINER_ID
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O
ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB
Slide 55
Slide 55 text
Side Car Containers
Slide 56
Slide 56 text
Agents and Daemons
• Ideally we’d want to schedule an agent or daemon on
each node via ECS Tasks.
• Current Work Arounds:
1. Bake it into your image.
2. Install on each host at provision time.
3. Automate with User Scripts and Launch Configs
Slide 57
Slide 57 text
Grant Privileges via IAM
$ aws iam create-role \
--role-name ecs-monitoring \
--assume-role-policy-document file://trust.policy
$ aws iam put-role-policy
--role-name ecs-monitoring
--policy-name ecs-monitoring-policy
--policy-document file://ecs.policy
$ aws iam create-instance-profile
--instance-profile-name ECSNode
$ aws iam add-role-to-instance-profile \
--instance-profile-name ECSNode \
--role-name ecs-monitoring
Open Questions
• Where is my container running?
• What is the capacity of my cluster?
• What port is my app running on?
• What’s the total throughput of my app?
• What’s its response time per tag? (app, version, region)
• What’s the distribution of 5xx error per container?
Slide 62
Slide 62 text
Service Discovery
Docker API ECS & CloudWatch
Monitoring Agent
Container
A O A O
Containers List &
Metadata
Additional Metadata
(Tags, etc)
Config Backend
Integration Configurations
Host Level
Metrics
Slide 63
Slide 63 text
Custom Metrics
• Instrument custom applications
• You know your key transactions best.
• Use async protocols like Etys’ STATSD