Monitoring ECS and Dynamic Infrastructure

Slide 1

Slide 1 text

Slide 2

Slide 2 text

$ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community   Interests: * Open Source * Large scale web operations * Monitoring and Metrics * Planning FL/OSS and DevOps Events (SCALE, TXLF, DevOpsDays, and more…)

Slide 3

Slide 3 text

• SaaS based infrastructure monitoring • Focus on modern infrastructure • Cloud, Containers, Micro Services • Processing nearly a trillion data points per day • Intelligent Alerting Datadog Overview

Slide 4

Slide 4 text

Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores, Caches, Queues and more... Monitor Everything

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

$ cat ~/.plan 1. Introduction: Why Containerize? 2. How: Collecting Docker and ECS Metrics 3. Finding the Signal: How do we know what to monitor? 4. Practice: Fitting it all together on ECS

Slide 7

Slide 7 text

Why Containerization?

Slide 8

Slide 8 text

More info at: www.datadoghq.com/docker-adoption/

Slide 9

Slide 9 text

Why Containers? • Avoid Dependency Hell

Slide 10

Slide 10 text

Why Containers? • Avoid Dependency Hell • Single Artifact Deployments

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Source: http://bit.ly/1SvvbuP Why Containers?

Slide 13

Slide 13 text

Source: http://bit.ly/1RQRsXW

Slide 14

Slide 14 text

Source: http://bit.ly/1qFylWK • Avoid Dependency Hell • Single Artifact Deployments • Quick, Low Cost Provisioning Why Containers?

Slide 15

Slide 15 text

Source: Datadog

Slide 16

Slide 16 text

ECS - Elastic Container Services • Automatically manages and schedules your containers as ‘tasks’  • Ensures tasks are always running based on your parameters • Integration with load balancing and routing via ELB.

Slide 17

Slide 17 text

Monitoring in Motion How do you define and monitor for normal when everything is changing around you? Between ECS and Containers you now have: • Containers moving between hosts. • Changing ports • and other changes underneath your feet.

Slide 18

Slide 18 text

Adding up the numbers… Docker Status API: 220+ Metrics per container

Slide 19

Slide 19 text

Adding up the numbers… Docker Status API: 223+ Metrics per container ECS CloudWatch Metrics: 4 per cluster + 2 per service

Slide 20

Slide 20 text

Adding up the numbers… Docker Status API: 223+ Metrics per container ECS CloudWatch Metrics: 4 per cluster + 2 per service OS Metrics: 100~ per instance

Slide 21

Slide 21 text

Docker Status API: 223+ Metrics per container ECS CloudWatch Metrics: 4 per cluster + 2 per service OS Metrics: 100~ per instance App Metrics: 50~ Adding up the numbers…

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Adding up the numbers… OS Metrics: 100~ per instance Docker Status API: 223+ Metrics per container ECS CloudWatch Metrics: 4 per cluster + 2 per service App Metrics: 50~ Metrics Overload!

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Host Centric

Slide 26

Slide 26 text

Service Centric

Slide 27

Slide 27 text

Avoiding Gaps

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Tags All the Way Down

Slide 30

Slide 30 text

Moving from statements to tag based queries “Monitor all containers running image web in region us-west-2 across all availability zones that use more than 1.5x the average memory on c3.xlarge”

Slide 31

Slide 31 text

Monitoring 101

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Collecting data is cheap;  not having it when you need it can be expensive

Slide 35

Slide 35 text

Instrument all the things!

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Monitoring 101: tl;dr Edition More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

Slide 38

Slide 38 text

tl;dr - Data Types

Slide 39

Slide 39 text

Examples: NGINX - Metrics Work Metrics:  Requests Per Second • Dropped Connections • Request Time • Error Rates Resource Metrics: • Disk I/O • Memory • CPU • Queue Length

Slide 40

Slide 40 text

Examples: NGINX - Events • Configuration Change • Code Deployment • Service Started / Stopped • etc

Slide 41

Slide 41 text

When to let a sleeping engineer lie?

Slide 42

Slide 42 text

When to alert?

Slide 43

Slide 43 text

Recurse until you find root cause

Slide 44

Slide 44 text

Getting at the Metrics • ECS vs Docker • Work Metrics vs Resource Metrics

Slide 45

Slide 45 text

Resource Metrics Utilization: • CPU (user + system) • memory • i/o • network traffic Saturation • throttling • swap Error • Network Errors   (receive vs transmit)

Slide 46

Slide 46 text

Docker and ECS Events • Starting / Stopping Containers • Auto-scaled Underlying Instances

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

CloudWatch and ECS Resources CPUReservation MemoryReservation CPUUtilization MemoryUtilization

Slide 49

Slide 49 text

How do we get at the upper layers?

Slide 50

Slide 50 text

Getting at the Metrics CPU METRICS MEMORY METRICS I/O METRICS NETWORK METRICS pseudo-files Yes Yes Some Yes, in 1.6.1+ stats command Basic Basic No Basic API Yes Yes Some Yes

Slide 51

Slide 51 text

Pseudo-files • Provide visibility into container metrics via the file system. • Generally under:   /cgroup//docker/$CONTAINER_ID/   or  /sys/fs/cgroup//docker/$CONTAINER_ID/ 

Slide 52

Slide 52 text

Pseudo-files: CPU Metrics $ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds) Pseudo-files: CPU Throttling

Slide 53

Slide 53 text

Docker API • Detailed streaming metrics as JSON HTTP socket  $ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats 

Slide 54

Slide 54 text

STATS Command # Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

Slide 55

Slide 55 text

Side Car Containers

Slide 56

Slide 56 text

Agents and Daemons • Ideally we’d want to schedule an agent or daemon on each node via ECS Tasks.  • Current Work Arounds: 1. Bake it into your image. 2. Install on each host at provision time. 3. Automate with User Scripts and Launch Configs

Slide 57

Slide 57 text

Grant Privileges via IAM $ aws iam create-role \  --role-name ecs-monitoring \  --assume-role-policy-document file://trust.policy $ aws iam put-role-policy --role-name ecs-monitoring  --policy-name ecs-monitoring-policy  --policy-document file://ecs.policy $ aws iam create-instance-profile   --instance-profile-name ECSNode $ aws iam add-role-to-instance-profile \ --instance-profile-name ECSNode \  --role-name ecs-monitoring

Slide 58

Slide 58 text

Create A User Script

Slide 59

Slide 59 text

Auto-Scale! $ aws autoscaling create-launch-configuration   --launch-configuration MyECSCluster --key-name my-key   --image-id AMI_ID --instance-type INSTANCE_TYPE   --user-data file://launch-script.txt --iam-instance-profile IAM_ROLE

Slide 60

Slide 60 text

Aren’t we still missing a layer?

Slide 61

Slide 61 text

Open Questions • Where is my container running? • What is the capacity of my cluster? • What port is my app running on? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container?

Slide 62

Slide 62 text

Service Discovery Docker API ECS & CloudWatch Monitoring Agent Container A O A O Containers List & Metadata Additional Metadata (Tags, etc) Config Backend Integration Configurations Host Level Metrics

Slide 63

Slide 63 text

Custom Metrics • Instrument custom applications  • You know your key transactions best.  • Use async protocols like Etys’ STATSD