Slide 1

Slide 1 text

Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic infrastructure. ContainerCon Toronto Aug 24, 2016 Ilan Rabinovitch Director, Technical Community
 Datadog

Slide 2

Slide 2 text

$ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community 
 Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events

Slide 3

Slide 3 text

• SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/) Datadog Overview

Slide 4

Slide 4 text

Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues and more... Monitor Everything

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

$ cat ~/.plan 1. Intro: The Importance of Monitoring 2. The Challenge: Monitoring Dynamic Infrastructure 3. Finding the Signal: How do we know what to monitor? 4. Implementation: Applying it to Containerized Workloads

Slide 7

Slide 7 text

Our Focus Area Culture Automation Metrics Sharing Damon Edwards and John Willis DevOps Day LA

Slide 8

Slide 8 text

Culture “organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations” - Melvin E. Conway

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Follow @honest_update on Twitter

Slide 15

Slide 15 text

Collecting data is cheap;
 not having it when you need it can be expensive

Slide 16

Slide 16 text

Instrument all the things!

Slide 17

Slide 17 text

Sharing Looping Back on Culture Describe the problem as your “enemy” not each other Learn Together

Slide 18

Slide 18 text

Sharing Using and Sharing the same metrics and measurements across teams is key to avoiding misunderstandings.

Slide 19

Slide 19 text

Source: http://bit.ly/1SvvbuP

Slide 20

Slide 20 text

Source: http://bit.ly/1RQRsXW

Slide 21

Slide 21 text

Operational Complexity Increases with.. • Number of things to measure
 • Velocity of change

Slide 22

Slide 22 text

https://www.datadoghq.com/docker-adoption/

Slide 23

Slide 23 text

How much we measure? 1 instance • 10 metrics from cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

How much we measure? 1 instance • 10 metrics from cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application
 N containers • 150*N metrics

Slide 26

Slide 26 text

Operational Complexity 100 instances 500 containers

Slide 27

Slide 27 text

Operational Complexity: Scale 160 metrics per host 800 metrics per host Assuming 5 containers per host

Slide 28

Slide 28 text

Operational Complexity: Scale 100 instances 80,000 metrics Assuming 5 containers per host

Slide 29

Slide 29 text

How much we measure? 1 instance • 10 metrics from cloud providers 1 operating system (e.g., Linux) • 100 metrics 50~ metrics per application
 N containers • 150*N metrics Metrics Overload!

Slide 30

Slide 30 text

Operational Complexity Increases with.. • Number of things to measure
 • Velocity of change

Slide 31

Slide 31 text

Source: Datadog

Slide 32

Slide 32 text

Source: http://bit.ly/1qFylWK

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Operational Complexity Increases with.. • Number of things to measure
 • Velocity of change

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Open Questions • Where is my container running? • What is the capacity of my cluster? • What port is my app running on? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container?

Slide 38

Slide 38 text

Source: http://bit.ly/1YxJ7Jy

Slide 39

Slide 39 text

More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

Slide 40

Slide 40 text

Monitoring 101

Slide 41

Slide 41 text

Finding Signal - Categorizing Your Metrics

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Examples: NGINX - Metrics Work Metrics: 
 • Requests Per Second • Request Time • Error Rates (4xx or 5xx) • Success (2xx) Resource Metrics:
 • Disk I/O • Memory • CPU • Queue Length

Slide 47

Slide 47 text

Examples: NGINX - Events • Configuration Change • Code Deployment • Service Started / Stopped

Slide 48

Slide 48 text

Examples: Events

Slide 49

Slide 49 text

When to let a sleeping engineer lie?

Slide 50

Slide 50 text

When to alert?

Slide 51

Slide 51 text

Recurse until you find root cause

Slide 52

Slide 52 text

What to demand from our monitoring tooling?

Slide 53

Slide 53 text

Cryptic Alerts W H A T ?

Slide 54

Slide 54 text

EVERY ALERT MUST BE ACTIONABLE

Slide 55

Slide 55 text

Host Centric

Slide 56

Slide 56 text

Service Centric

Slide 57

Slide 57 text

Static configurations tracking dynamic infrastructure are not a recipe for success. Static vs Dynamic

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Query Based Monitoring “What’s the average throughput of application:nginx per version ?” “Alert me when one of my pod from replication controller:foo is not behaving like the others?” “Show me rate of HTTP 500 responses from nginx” “… across all data centers” “… running my app version 2….”

Slide 61

Slide 61 text

Getting at the metrics…

Slide 62

Slide 62 text

Resource Metrics Utilization: • CPU (user + system) • memory • i/o • network traffic Saturation • throttling • swap Error • Network Errors 
 (receive vs transmit)

Slide 63

Slide 63 text

Container Events • Starting / Stopping Containers • Scaling Events for Underlying Instances • Deploying a new container build

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

How do we get at the upper layers?

Slide 66

Slide 66 text

Getting at the Metrics CPU METRICS MEMORY METRICS I/O METRICS NETWORK METRICS pseudo-files Yes Yes Some Yes, in 1.6.1+ stats command Basic Basic No Basic API Yes Yes Some Yes

Slide 67

Slide 67 text

Pseudo-files • Provide visibility into container metrics via the file system. • Generally under: 
 /cgroup//docker/$CONTAINER_ID/ 
 or
 /sys/fs/cgroup//docker/$CONTAINER_ID/


Slide 68

Slide 68 text

Pseudo-files: CPU Metrics $ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds) Pseudo-files: CPU Throttling

Slide 69

Slide 69 text

Docker API • Detailed streaming metrics as JSON HTTP socket
 $ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats


Slide 70

Slide 70 text

STATS Command # Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

Slide 71

Slide 71 text

Side Car Containers

Slide 72

Slide 72 text

Aren’t we still missing a layer?

Slide 73

Slide 73 text

Open Questions • What is the capacity of my cluster? • What’s the total throughput of my app? • What’s its response time per tag? (app, version, region) • What’s the distribution of 5xx error per container? • Where is my container running? what port?

Slide 74

Slide 74 text

Service Discovery Docker API Orchestrator Monitoring Agent Container A O A O Containers List & Metadata Additional Metadata (Tags, etc) Config Backend Integration Configurations Host Level Metrics

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

Custom Metrics • Instrument custom applications
 • You know your key transactions best.
 • Use async protocols like Etys’ STATSD or 
 DogstatsD

Slide 77

Slide 77 text

Source: http://bit.ly/1NoW6aj

Slide 78

Slide 78 text

Resources Monitoring 101: Alerting 
 https://www.datadoghq.com/blog/monitoring-101-alerting/ Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/ Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/
 The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/ How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/ 8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/